Loading multiple data files

NoonanM commented 6 years ago

Some users might have their data contained in multiple csv files, but the app currently only allows for one file to be loaded in at a time. It would be useful to have the option to load more than on file.

xhdong-umd commented 6 years ago

I can make it take multiple files then merge into single file to process. Though I'm a little bit worried if there could be column mismatch among these files, then a simple merge may meet problems.

For example some file have some extra columns which are not available in other files. Merging them may create some empty columns for other data. I'm not sure if empty columns will cause problems in some calculations.

xhdong-umd commented 6 years ago

This actually should belong to the data preparation step if we decided to have that. User can preview the merge and find problems, possibly select different combinations to avoid problem.

chfleming commented 6 years ago

Can your run as.telemetry on the individual files and merge them after?

chfleming commented 6 years ago

I just realized that wouldn't keep them in the same projection. as.telemetry should be able to handle individuals with NA columns, though.

xhdong-umd commented 6 years ago

as.telemetry can work on each individual file, and each telemetry object is independent from each other.

The data frame/data.table used in my app need to merge all animals in one table, that requires same columns for all file, or adding empty columns for file with less columns. The app will keep the telemetry object for all calculations need a telemetry input, but that means the data in merged table can be different from the telemetry object (mainly some empty columns).

Merge them into single csv before importing to as.telemetry should at least make the data consistent across the telemetry object and the data.table. It should also keep them in same projection.

However I'm not sure if these files can be merged directly without problem, i.e. if some file has different format (like time stamp format). I think having this feature may make user think they can combine any file, and that could include files with very different format, even different column names.

xhdong-umd commented 6 years ago

It's possible for now to implement the feature with simplest case

all files are in same format, i.e. with same columns.
I may also need to consider zip files. Previously zip file containing single file can be uploaded, now I'll try to take multiple zip file or one zip containing multiple files.

Any inconsistence in format/columns cannot be handled automatically, and we have to ask user to solve that first. Hopefully in most cases the format are consistent.

xhdong-umd commented 6 years ago

@chfleming @NoonanM Do we have some sample data files that representative the common cases? And do we want to load multiple data files came from different animals/format?

One approach to solve the column name mismatches across the files:

use as.telemetry to import each one, merge the time, long, lat columns of each telemetry object into a single data frame (with identity column added), then import with as.telemetry again.
this will ensure the various possible column names become consistent name so merge will be clean. And the projection in the end should be same for all.
there will be some duplicated process, but this is a easy hack unless we abstract the data cleaning, column detection part of as.telemetry into a function.

chfleming commented 6 years ago

That way could drop optional columns that are imported by as.telemetry, such as errors, velocities, etc.. The following will work better, but is slightly inefficient as projection happens twice:

Import the individual files. The will be in different, local projections.
Make a data.frame of all longitudes & latitudes and feed that into ctmm:::suggest.projection(data.frame). The output will be a projection string centered on and oriented to the data.
You can then reproject all data to this one projection via projection(telemetry_object)<- projection_string on the telemetry objects.

xhdong-umd commented 6 years ago

@chfleming @NoonanM Do you have some sample data files that can be used to test this use case?

xhdong-umd commented 6 years ago

I'm testing with some data and found this error.

# change to downloaded files
files <- c("/Users/xhdong/Projects/ctmm-Shiny/data/buffalo/Kruger African Buffalo, GPS tracking, South Africa.csv",
           "/Users/xhdong/Projects/ctmm-Shiny/data/gulls/FTZ_ Foraging in lesser black-backed gulls (data from Garthe et al. 2016).csv")

tele_list_list <- lapply(files, as.telemetry)
# drop down the level from each file, into items of animal names
tele_list <- unlist(tele_list_list, recursive = FALSE)

df_list <- lapply(tele_list, 
                  function(tele) { tele[c("longitude", "latitude")] })
dt <- rbindlist(df_list)
proj_suggested <- ctmm:::suggest.projection(dt)
lapply(tele_list, function(tele) {
  ctmm::projection(tele) <- proj_suggested
  return(tele)
})
# meet error at No.10
projection(tele_list[[10]]) <- proj_suggested

Error in `[<-.data.frame`(x3, i, ..., value = value) : 
  replacement has 0 items, need 11928

chfleming commented 6 years ago

I didn't have a method rbindlist or see one in ctmmweb, so I just ran do.call on rbind here.
I realized that applying projection()<- on a list of objects would be useful, so I implemented that.
This individual (4th for me) had a row with speed but NA heading, which my code wasn't prepared for. The number of satellites was also NA for that row, so it was like an incomplete or corrupted measurement. as.telemetry now has an option rm.na that determines how incomplete measurements are handled---is the row deleted or is the column deleted. The default is the row. This is a kind of device failure, so I don't know what best practice would be, but here rm.na="row" seems to make the most sense. I also added code to make sure that some information is complete regarding the velocity vector and error ellipses.

I am running a check on the code now and will push to GitHub when it finishes.

xhdong-umd commented 6 years ago

@chfleming I assume the code is already finished and in github now?

chfleming commented 6 years ago

Yes, sorry.

xhdong-umd commented 6 years ago

When we are importing multiple files, the result usually is a list of telemetry objects, named by animal id.

What should we do when there are duplicated animal id from different files? The app assume all animals have unique names (this is not a problem when you import single file). Should different files of same animal id combined as a single telemetry object with data merged?

I think we at least need to give a warning message about this.

chfleming commented 6 years ago

I would run the names through make.unique

xhdong-umd commented 6 years ago

OK. That's a good idea.

Is there any user case that multiple data file in different time period of same animal being uploaded?

chfleming commented 6 years ago

That could happen.

xhdong-umd commented 6 years ago

In that case each file's data will generate a separate telemetry object with the animal id varied. This may not be optimal, but it's difficult to separate two cases:

A. valid same name animal across files
B. accidental name conflicts across files

I think we can only process with one assumption of these two cases, and provide some warning messages.

Alternatively we can add an option to treat name conflict with assumption A or B, if they are both quite possible.

xhdong-umd commented 6 years ago

I implemented the importing with multiple files in app.

Multiple file uploading is only supported with newer browser, like Chrome.
If duplicate names were found, there will be warnings
For now duplicate names are varied to make them unique. If user have valid multiple files with same animal name, they can always merge files first. If this is a common case, we can try to add an option to merge them together, though that could become quite complicated with more corner cases (should duplicated data be removed? what if there are conflicts, like different data in same time period?) Maybe just treat them as unique individuals are actually not a bad thing? For example they may have quite different home range.

chfleming commented 6 years ago

Multiple datasets on the same individual are going to be cases where there were multiple collar deployments. They might differ in data quality, in which case their errors might need to be calibrated separately before merging, but they shouldn't overlap in time.

xhdong-umd commented 6 years ago

So even if we do merge them, it's not a simple task considering the errors.

Maybe we should put this task in data preparation step. Is the current treatment of varying names acceptable for now?

chfleming commented 6 years ago

Yes. We should leave the option to merge for after uere.

xhdong-umd commented 6 years ago

@chfleming The ctmm:::suggest.projection(data.frame) is no longer available with newest version of ctmm? I didn't find a different name for this function.

chfleming commented 6 years ago

@xhdong-umd Using low-level functions is no longer necessary as of yesterday's update. I sent an email but forgot to point that out. See the example here: https://ctmm-initiative.github.io/ctmm/reference/projection.html

xhdong-umd commented 6 years ago

So instead of previous code that taking long/lat data.frame, I just use median(buffalo,k=2) on the list of telemetry objects with different projections to get the new projection?

The help of projection said median return median of a telemetry object, it actually also work on a list of telemetry object, right?

xhdong-umd commented 6 years ago

I met this error when importing a file:

data file

> as.telemetry("/Users/xhdong/Projects/ctmm-Shiny/data/buffalo/Kruger African Buffalo, GPS tracking, South Africa.csv.zip")
Minimum sampling interval of 3 minutes in Cilla
Minimum sampling interval of 0 seconds in Gabs
Minimum sampling interval of 2 minutes in Mvubu
Minimum sampling interval of 0 seconds in Pepper
Minimum sampling interval of 0 seconds in Queen
Minimum sampling interval of 5 minutes in Toni
Error in rbind(proj)[, c("longitude", "latitude")] : 
  subscript out of bounds

chfleming commented 6 years ago

Yes on the first questions and I'm looking into the import bug.

chfleming commented 6 years ago

Should be fixed now.

xhdong-umd commented 6 years ago

Yes I verified it and updated the app to use the new median function instead of the low level function.

ctmm-initiative / ctmmweb

Loading multiple data files #58