Best way to handle large datasets in GitHub repos? - Githubissues

CatalogueOfLife / backend

Complete backend of COL ChecklistBank

Apache License 2.0

15 stars 11 forks source link

Best way to handle large datasets in GitHub repos? #544

Closed gdower closed 4 years ago

gdower commented 4 years ago

With the new version of World Plants, I'm getting warnings from GitHub about file sizes:

Counting objects: 20, done.
Delta compression using up to 12 threads.
Compressing objects: 100% (19/19), done.
Writing objects: 100% (20/20), 30.19 MiB | 2.09 MiB/s, done.
Total 20 (delta 15), reused 0 (delta 0)
remote: Resolving deltas: 100% (15/15), completed with 11 local objects.
remote: warning: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File coldp/Name.tsv is 80.76 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
remote: warning: File raw/world_plants.psv is 62.30 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
To github.com:Sp2000/data-world-plants.git
   b14ba64..fdec51f  CoLDP -> CoLDP

I could look into using GitHub LFS, or maybe we should gzip the text files? The downside of gzipping text files might be not being able to use diff, but diff doesn't work in the web interface for large files anyway. I made an experimental git repo to test gz diffing and it results in:

git diff 0bb462e c7bf613

diff --git a/test.txt.gz b/test.txt.gz
index b67e52e..251b5d9 100644
Binary files a/test.txt.gz and b/test.txt.gz differ

git diff --text 0bb462e c7bf613

diff --git a/test.txt.gz b/test.txt.gz
index b67e52e..251b5d9 100644
--- a/test.txt.gz
+++ b/test.txt.gz
@@ -1 +1 @@
-^_<8B>^H^H|<EA><C2>]^@^Ctest.txt^@^KI-.<C9><CC>KW(<CF>H-<C9>H-RH<C9>LKSHÑ0ҫ2^K^T<D2>2sR<8B><F5><F4><F4>^Tr<F3>sS<F3>J^T<F2><D3>^TJ<8A>JK2<F4><B8>^@J<8C><B4><BB>=^@^@^@
\ No newline at end of file
+^_<8B>^H^HM<EA><C2>]^@^Ctest.txt^@^KI-.<C9><CC>KW(<CF>H-<C9>H-RH<C9>LKSHÑ0ҫ2^K^T<D2>2sR<8B><F5><B8>^@fh1^B*^@^@^@
\ No newline at end of file

mdoering commented 4 years ago

yes, I always had compressed large files in github. That is one way of doing it. Note that we do not need to go through github in all cases. We could as well place the zipped files anywhere else. Having a github repo to keep all the conversion scripts and metadata is still useful I believe

mdoering commented 4 years ago

Files larger than 100MB are not accepted at all

mdoering commented 4 years ago

If we used https://git-lfs.github.com/ for large text files I am not sure that github zips them up in the main archive. If yes that would work, if not then no.

mdoering commented 4 years ago

lets just use zip files in the repo, proven to work.

gdower commented 4 years ago

@mdoering, I tried to upload a tar.gz of tsv.gz files and at least currently, it doesn't work)),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWyLa2lBHCKcR6PFXu26,key:level,negate:!t,type:phrase,value:DEBUG),query:(match:(level:(query:DEBUG,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWyLa2lBHCKcR6PFXu26,key:level,negate:!t,type:phrase,value:INFO),query:(match:(level:(query:INFO,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWyLa2lBHCKcR6PFXu26,key:dataset,negate:!f,type:phrase,value:'2066'),query:(match:(dataset:(query:'2066',type:phrase))))),predecessorCount:5,sort:!('@timestamp',desc),successorCount:5)&_g=(refreshInterval:(display:On,pause:!f,value:0),time:(from:now-7d,mode:quick,to:now))):

org.col.importer.NormalizationFailedException$SourceInvalidException: No data files found in /home/col/bin/dev/scratch/2066/source
    at org.col.csv.CsvReader.validate(CsvReader.java:124)
    at org.col.importer.coldp.ColdpReader.validate(ColdpReader.java:93)
    at org.col.csv.CsvReader.<init>(CsvReader.java:92)
    at org.col.importer.coldp.ColdpReader.<init>(ColdpReader.java:44)
    at org.col.importer.coldp.ColdpReader.from(ColdpReader.java:66)
    at org.col.importer.coldp.ColdpInserter.<init>(ColdpInserter.java:49)
    at org.col.importer.Normalizer.insertData(Normalizer.java:800)
    at org.col.importer.Normalizer.call(Normalizer.java:75)
    at org.col.importer.ImportJob.importDataset(ImportJob.java:206)
    at org.col.importer.ImportJob.run(ImportJob.java:122)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

A tar.gz archive with uncompressed Name.tsv does work.

mdoering commented 4 years ago

ah, you tried to zip each data file individually. Yes, that does not work. You need to zip up the entire archive and point the normalizer at the raw archive file

gdower commented 4 years ago

With large datasets, github will reject 1 large 100 MB .tar.gz of text files, but it might be less likely that we'd hit the 100 MB threshold for individually gzipped files: (e.g., Name.tsv.gz, Taxon.tsv.gz). I don't really like the idea of gzipping the tsv's anyway though because it messes up git diff, so I'll try to find another solution.

mdoering commented 4 years ago

I have place IPNI for example here: https://github.com/mdoering/ipni