Closed cmdcolin closed 6 years ago
Thomas on #bioperl provided the following tip
17:20 < trs> cdiesh: Perl version? 17:22 < trs> I don't know what the test code is doing, but if it's relying on the order of keys %hash or values %hash somewhere being stable, that'll fail on Perl >= 5.18.0
Confirmed using perlbrew to install 5.18 on Mac OSX. Tests pass fine on perl 5.16 normally on Mac OSX.
Here's what the nclist in sample_data/json/volvox/tracks/Genes/ctgA/trackData.json looks like on a system with perl5.16
EDIT: to include the whole intervals->nclist subtree
Here's what the same nclist block looks like on a machine with perl 5.18 (note the contents of the feature data are split up)
To be clear about the source of the problem, for example:
In the perl 5.18 code, the class structure for the class order "0" is:
[1:"Start", 2:"End", 3:"Strand", 4:"Note", 5:"Load_id", 6:"Type", 7:"Source", 8:"Subfeatures", 9:"Name", 10:"Seq_id"]
and then data structure in nclist matches this:
[1:1049, 2:9000, 3:1, 4:"protein kinase", 5:"EDEN", 6:"gene", 7:"example", 8:subfeatures, 9:"EDEN", 10: "ctgA"]
In perl 5.16 the structure of class order 0 is:
["Start", "End", "Strand", "Source", "Seq_id", "Load_id", "Name", "Note", "Type", "Subfeatures"]
and then the data structure in nclist matches this:
[ 1049,9000, 1, 'example', 'ctgA', 'EDEN', 'EDEN', 'protein kinase', 'gene']
Then, the test code assumes that the structure of class order 0 matches some pre-defined method, when in fact it appears this assumption is invalid. The test code will be updated
Here is a full output using perl 5.18 on ubuntu (large file 127kb). It fails flatfile-to-json and generate-names tests http://pastebin.com/ZDt5nBm0
Note: Example of problem in flatfile-to-json where many NCList ArrayRepr classes are dynamically created just slightly shuffled around
There is a pretty unfortunate consequence of this issue which is that using perl 5.18 and over with flatfile-to-json will take much longer and causes much bigger file sizes
This was sort of alluded to in previous comments here already, basically the fact that the hash order is randomized means that a bunch of combinatorial possibilities of feature types are generated (e.g. some features are represented by start,end,name,id,parent in trackData.json, some are represented by name,end,start,parent,id just with data values switched around etc.)
The data works at runtime but this inflates the size of the files and takes longer to run.
Here's a short example parsing a 280MB gff
Perl 5.14, takes about 5 minutes
time bin/flatfile-to-json.pl --gff file.gff --sortMem 1000000000 --trackLabel test_5_14
205.80s user 10.97s system 73% **cpu 4:54.56 total**
Perl 5.18, takes almost 4 hours
time bin/flatfile-to-json.pl --gff file.gff --sortMem 1000000000 --trackLabel test_5_18
13749.42s user 240.97s system 99% **cpu 3:54:43.13 total**
Not only this but the disk size is vastly huger
In the perl 5.14 data directory, the disk size is 366MB for this track. In the 5.18 instance, the disk space is 21 GB (gigabytes)
Therefore there is a 66x increase in running time and a 57x increase in disk space consumption!
Due to this, it might be advisable to (a) put a big warning saying to use versions earlier than 5.18 because 5.18 was when perl made the hash order randomization and/or (b) fix this bug
This seems weird to report about only now but I think this is reproducible and sucks for the end user. Also perl 5.18 and over probably only recently became the default perl distribution on newer operating systems so more users will experience this
Possible solution: everywhere where it says "keys %hash" replace it with "sort keys %hash".
Here is a test GFF I think I recall demonstrated the very long run time and disk space blowup (not all gffs seem to do this) ftp://ftp.ncbi.nlm.nih.gov/genomes/Scleropages_formosus/GFF/ref_ASM162426v1_top_level.gff3.gz
Looks like the changes you made in #912 fix that performance regression. Nicely done.
Fixed! Merged the PR. Thanks so much @cmdcolin
Results of prove -I src/perl5 -lr tests/ on Ubuntu 14 with perl 5.18.2