Closed gdower closed 4 years ago
yes, I always had compressed large files in github. That is one way of doing it. Note that we do not need to go through github in all cases. We could as well place the zipped files anywhere else. Having a github repo to keep all the conversion scripts and metadata is still useful I believe
Files larger than 100MB are not accepted at all
If we used https://git-lfs.github.com/ for large text files I am not sure that github zips them up in the main archive. If yes that would work, if not then no.
lets just use zip files in the repo, proven to work.
@mdoering, I tried to upload a tar.gz of tsv.gz files and at least currently, it doesn't work)),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWyLa2lBHCKcR6PFXu26,key:level,negate:!t,type:phrase,value:DEBUG),query:(match:(level:(query:DEBUG,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWyLa2lBHCKcR6PFXu26,key:level,negate:!t,type:phrase,value:INFO),query:(match:(level:(query:INFO,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWyLa2lBHCKcR6PFXu26,key:dataset,negate:!f,type:phrase,value:'2066'),query:(match:(dataset:(query:'2066',type:phrase))))),predecessorCount:5,sort:!('@timestamp',desc),successorCount:5)&_g=(refreshInterval:(display:On,pause:!f,value:0),time:(from:now-7d,mode:quick,to:now))):
org.col.importer.NormalizationFailedException$SourceInvalidException: No data files found in /home/col/bin/dev/scratch/2066/source
at org.col.csv.CsvReader.validate(CsvReader.java:124)
at org.col.importer.coldp.ColdpReader.validate(ColdpReader.java:93)
at org.col.csv.CsvReader.<init>(CsvReader.java:92)
at org.col.importer.coldp.ColdpReader.<init>(ColdpReader.java:44)
at org.col.importer.coldp.ColdpReader.from(ColdpReader.java:66)
at org.col.importer.coldp.ColdpInserter.<init>(ColdpInserter.java:49)
at org.col.importer.Normalizer.insertData(Normalizer.java:800)
at org.col.importer.Normalizer.call(Normalizer.java:75)
at org.col.importer.ImportJob.importDataset(ImportJob.java:206)
at org.col.importer.ImportJob.run(ImportJob.java:122)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
ah, you tried to zip each data file individually. Yes, that does not work. You need to zip up the entire archive and point the normalizer at the raw archive file
With large datasets, github will reject 1 large 100 MB .tar.gz of text files, but it might be less likely that we'd hit the 100 MB threshold for individually gzipped files: (e.g., Name.tsv.gz, Taxon.tsv.gz). I don't really like the idea of gzipping the tsv's anyway though because it messes up git diff, so I'll try to find another solution.
I have place IPNI for example here: https://github.com/mdoering/ipni
With the new version of World Plants, I'm getting warnings from GitHub about file sizes:
I could look into using GitHub LFS, or maybe we should gzip the text files? The downside of gzipping text files might be not being able to use diff, but diff doesn't work in the web interface for large files anyway. I made an experimental git repo to test gz diffing and it results in:
git diff 0bb462e c7bf613
git diff --text 0bb462e c7bf613