aljpetri / isONform

De novo construction of isoforms from long-read data
GNU General Public License v3.0
18 stars 2 forks source link

suggestions with big data #16

Open alexyfyf opened 11 months ago

alexyfyf commented 11 months ago

Hi Alex,

I found your tool generating a lot of intermedia files (also from isonclust and isoncorrect). It consumes my inodes quickly. Any suggestions how to alleviate this for big dataset? Would increase (or decrease) --max_seqs or --max_seqs_to_spoa help?

Thank you so much. Cheers,

alexyfyf commented 10 months ago

also i noticed in your pipeline, you set inonclust --k 8 --w 9 rather than the default --k 13 --w 20 for ONT data, which also slow down a lot of clustering step. Any reason for choosing that?

aljpetri commented 10 months ago

Hi thank you very much again for reporting your findings.

also i noticed in your pipeline, you set isonclust --k 8 --w 9 rather than the default --k 13 --w 20 for ONT data, which also slow down a lot of clustering step. Any reason for choosing that?

I have fixed this in commit 2f40387 and also changed the name of the run_mode to ont instead of analysis to make clearer what the mode is used for. The parameters k and w were used in our analyses to alleviate any possible impacts of isONclust on the final results but are not recommended to be run with with ONT data sets.

Any suggestions how to alleviate this for big dataset?

If you refer to the number of clusters (isONclust and isONcorrect), one thing you could try is to set a higher value for iso_abundance when running the pipeline. This would require more reads per cluster to be formed (for isONclust and isONcorrect) as well as a higher number of reads supporting an isoform to be called and should reduce the number of clusters. This, however, might mean that some isoforms with very low read support might not be called. If this is not what you meant could you explain a bit more? Best, Alex

alexyfyf commented 9 months ago

Hi, sorry for the late reply. Thanks for your suggestions. And what if I already have a lot of clusters, and when I run isONform_parallel.py, is there any parameters that can improve the speed and IO? My issues are when I run isONform_parallel.py, too many temporary files were generated, and quickly used up my inode. I would like some suggestions to (1) reduce the tmp files generated, (2) increased speed for isONform_parrallel.py.

Cheers, Alex