Large reference genomes (>100 Mbps) crash both at the reference upload and the dataset analysis stages

Recently a user tried uploading a reference for Phytophthora infestans. That genome is about 20 times larger than C. albicans, having around 200-250 Mbps. Worse, the current assembly has 1318 scaffolds. A few problems were noted:

genome.install_1.php stalls on large genomes. This is because, to get a basic overview about the file, it converts the whole thing with FASTA_reformat_1.sh, reads in its entirety into memory, and then converts it back with FASTA_reformat_2.sh. As PHP scripts have both a memory and time limits to run (the default is ~130 MBs and 30 seconds), it's just killed by Apache. I tried doing away with the reformatting and parsing the file using a buffer, but that's also too slow in PHP. I don't think changing the PHP limit defaults is a good practice, either. However, biopython's SeqIO does the job very well. Attached is the version I ended up with today, which uses a Python script to generate the summary, which is then read by PHP. The Python script runs under a few seconds even for the >200 MB P. infestans reference. If we decide to drop the reformatting, we have to make sure it doesn't hurt anything downstream.
1318 scaffolds is way too many. Even after removing scaffolds <100 Kbps there were still almost 500 remaining.
1. Design issues aside, the current implementation has a hard limit on the number of scaffolds it can handle, as they are stored in the $_SESSION variable near the end of the genome.install_1.php script. Apparently, it's limited to 1000 data points (that limit is breached at 400 scaffolds). We could use the Python script from above to store this information separately to a file, for future scripts to read from there, instead of passing it within the session. (Though strictly this is not necessary for working with just 50 scaffolds.)
2. In the case of too many scaffolds, I wonder if we can store the results in tables, so even if visualization of the whole genome can't be done, users can still investigate LOH/aneuploidy using them.
Even when I used the top 50 longest scaffolds (>80 Mbps combined) I had the following crash:

Error using axes
While setting the 'Position' property of 'Axes':
Value must be numeric

Error in subplot (line 490)
        ax = axes('Units', 'normalized', 'Position', position, ...

Error in CNV_v6_6 (line 755)
        subplot('Position',[left bottom width height]);

Error in analyze_CNVs_1 (line 25)
CNV_v6_6(main_dir,user,genomeUser,project,genome,ploidyEstimateString,ploidyBaseString,CNV_verString,rDNA_verString,displayBREAKS,
referenceCHR);

Error in processing1 (line 4)
    analyze_CNVs_1('/Users/bermanlab/dev/ymap/scripts_seqModules/scripts_WGseq/../../','testuser','testuser','infestans_small','infestans-test-1','2.0','2.0');

It seems that something went wrong in the generation of the figure definitions, though I didn't have time to investigate further.

The actionable goal of this issue is to make sure we can work with the longest 50 scaffolds of this organism. We can later consider what to do with the rest. @darrenabbey , do you have any ideas about how to handle such genomes?

I do have some ideas. Right now the chromosome arrangement for the figures is determined by a very simple algorithm, which generates a text file definition than later code sections refer to. This was done so a module for doing more intelligent figure definitions could later be implemented, but still using the text format for later steps.

I've been thinking of writing a graphical interface for defining which chromosomes to use and where they should be placed in a figure. I don't have any code for this, just concepts.

The memory limitation issue in ensuring the genome sequence file is properly ordered is only really necessary because the figure arrangement code is so simple. An improvement that determines contig lengths without packing them entirely into memory could then place them into the figure with respect to their size instead of their ordering in a sorted fasta file.

On Nov 13, 2016 4:55 PM, "Vladimir Gritsenko" notifications@github.com wrote:

Recently a user tried uploading a reference for Phytophthora infestans. That genome is about 20 times larger than C. albicans, having around 200-250 Mbps. Worse, the current assemly has 1318 scaffolds. A few problems were noted:

genome.install_1.php stalls on large genomes. This is because, to get a basic overview about the file, it converts the whole thing with FASTA_reformat_1.sh, reads in its entirety into memory, and then converts it back with FASTA_reformat_2.sh. As PHP scripts have both a memory and time limits to run (the default is ~130 MBs and 30 seconds), it's just killed by Apache. I tried doing away with the reformatting and parsing the file using a buffer, but that's also too slow in PHP. I don't think changing the PHP limit defaults is a good practice, either. However, biopython's SeqIO does the job very well. Attached https://github.com/berman-lab/ymap/files/588132/Archive.zip is the version I ended up with today, which uses a Python script to generate the summary, which is then read by PHP. The Python script runs under a few seconds even for the >200 MB P. infestans reference. If we decide to drop the reformatting, we have to make sure it doesn't hurt anything downstream.

1318 scaffolds is way too many. Even after removing scaffolds <100 Kbps there were still almost 500 remaining.

Design issues aside, the current implementation has a hard limit on the number of scaffolds it can handle, as they are stored in the $_SESSION variable near the end of the genome.install_1.php script. Apparently, it's limited to 1000 data points. We could use the Python script from above to store this information separately to a file, for future scripts to read from there, instead of passing it within the session. (Though strictly this is not necessary for working with just 50 scaffolds.)

In the case of too many scaffolds, I wonder if we can generate data in tables, so even if visualization of the whole genome can't be done, users can still investigate LOH/aneuploidy using that method.

Even when I used the top 50 longest scaffolds (>80 Mbps combined) I had the following crash:

Error using axes While setting the 'Position' property of 'Axes': Value must be numeric

Error in subplot (line 490) ax = axes('Units', 'normalized', 'Position', position, ...

Error in CNV_v6_6 (line 755) subplot('Position',[left bottom width height]);

Error in analyze_CNVs_1 (line 25) CNV_v6_6(main_dir,user,genomeUser,project,genome,ploidyEstimateString,ploidyBaseString,CNV_verString,rDNA_verString,displayBREAKS, referenceCHR);

Error in processing1 (line 4) analyze_CNVs_1('/Users/bermanlab/dev/ymap/scripts_seqModules/scripts_WGseq/../../','testuser','testuser','infestans_small','infestans-test-1','2.0','2.0');

It seems that something went wrong in the generation of the figure definitions, though I didn't have time to investigate further.

The actionable goal of this issue is to make sure we can work with the longest 50 scaffolds of this organism. We can later consider what to do with the rest. @darrenabbey https://github.com/darrenabbey , do you have any ideas about how to handle such genomes?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berman-lab/ymap/issues/58, or mute the thread https://github.com/notifications/unsubscribe-auth/AKPuRDmVttYnxJzqy1waWlXh6IzqOOhCks5q95VHgaJpZM4Kw1UZ .

Few updates about the problem:

The main problem was that in genome.install_1.php the script loaded all of the fasta file to the memory in order to extract the fasta chromosome names and their size, since the Phytophthora infestans fasta file the script failed to load it entirely.
The reformatting scripts FASTA_reformat_1.sh and FASTA_reformat_2.sh work very well and pretty fast so I don't see any need to change them (the reformatting is done with perl using regular expressions so basically it's an efficient way)
Running with all of the 1318 scaffolds does crash since there is the limit of up to 1000 input variables (including array members) however I don't think this case should be handled since displaying so many chromosomes/scaffold will not be feasible (the output figure would have to be significantly larger which cause all sort of problems in matlab)

2 Changes that I have made the seems to solve the problem:

I have changed the reading of the fasta file during the genome processing to be done buffer by buffer using a buffer of 4096 chars, this way the process doesn't crash, use less memory and actually runs faster.
During the change at first I made a mistake the caused one of the chromosome sizes to be zero and it caused the crashing of the pipeline so I have decided to add an extra checking that will avoid trying to draw a zero size chromosome (this should never happen and in fact didn't happen here too but since I have encountered this I have decided that adding this as a precaution is the best way

Currently I have managed to run the dataset with the longest scaffolds and later I will try with 100-250 longest scaffolds and I am expecting that the only problem should be the limit of the input files (probably around 250 scaffolds will be the maximum available since there are internal php variables and around 3-4 arrays that hold a cell for each chromosome)

@vladimirg regarding the problem you have experienced in point 3, i didn't manage to reproduce the problem, but it seems that an invalid creation of the genome files have caused the problem

Merged into master. The end result is as follows: we keep all the chromosomes, but only let the user choose at most 50 from the 300 longest ones. Bowtie will align against the original FASTA, but the downstream tools will only analyze those chromosomes that the user selected.

berman-lab / ymap

Large reference genomes (>100 Mbps) crash both at the reference upload and the dataset analysis stages #58