KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
622 stars 159 forks source link

false cache entry from autoloading data #288

Open alsmnn opened 5 years ago

alsmnn commented 5 years ago

Report an Issue / Request a Feature

Autoloading data creates a cache entry with the name data instead of the name of the dataset

I'm submitting a (Check one with "x") :


Issue Severity Classification -

(Check one with "x") :

Expected Behavior

load.project() creates a cache entry for every file in data/ with the corresponding name of the file in data/

Current Behavior

load.project() creates a cache entry with the name data , ignoring the original name of the file in data/

Steps to Reproduce Behavior

load.project() with a file in data/

Screenshots

grafik

Version Information
          Package           Version 
"ProjectTemplate"           "0.8.2" 

R version 3.5.1

Possible Solution

-/-

Best regards, @AljoLe

Hugovdberg commented 5 years ago

What type of file are you trying to load? The .ACC is not supported, and should normally not be printed. The cached name is determined by detecting which new variables are created by the reader. So we need to know which reader is causing this. Could you post the complete filename that's causing this issue?

alsmnn commented 5 years ago

Hi @Hugovdberg, the name of the file is tcga.ACC.RData and list.data()is showing:

> list.data()
               filename  varname is_ignored is_directory is_cached cache_only       reader
              README.md               FALSE        FALSE     FALSE      FALSE             
tcga.ACC tcga.ACC.RData tcga.ACC      FALSE        FALSE     FALSE      FALSE rdata.reader

I already tried tcga_ACC.RData and tcga-ACC.RData, but PT is converting it to tcga.ACCanyway.

Best regards, @AljoLe

Hugovdberg commented 5 years ago

Ah, now I see what's going on, the variable name is initially determined by ProjectTemplate from the filename. However, .RData files are simply loaded into the global environment, and therefore ignore this initial variable name. Apparently your tcga.ACC.RData contains a variable called data, and therefore that's the name that's used for caching.

After loading the data any new variables in the global environment are cached by their actual name. There are several readers which alter the variable names during loading (eg, all sheets are read from Excel files, and the sheetname is appended to the initial filename). Variables are therefore not cached by the original filename, which unfortunately breaks the link between the file and the cache.

There is currently no solution to this problem, besides a rewrite of the data loading system. It would require a file info reader for each file type that reports the exact variable names as they would be created by the corresponding reader. I've done some work on that to create such a framework, but currently don't have the time to continue this major overhaul.