HaploKit / Strainline

Full-length de novo viral haplotype reconstruction from noisy long reads
GNU General Public License v3.0
17 stars 5 forks source link

memory issue #8

Open antoine4ucsd opened 2 years ago

antoine4ucsd commented 2 years ago

First, congrats for your publication. This was much needed. I am trying to apply it to nanopore SIV data. I was able to make it run for several samples on our linux servers with the following command:

./src/strainline.sh -i ./in/mydata.fa -o ./out/ -k 100 --maxGD 0.005 --maxLD 0.001 --minOvlpLen 1000 --minSeedLen 2000 --minAbun 0.01 --maxOH 20 -p ont --minIdentity 0.995   -t 24

But for the majority, I have a memory error. I am attaching the log. Any suggestions to optimize? Would you recommend different settings? My goal is to recover SIV FL genome (~9kb) haplotypes with their relative frequencies.

thank you! strainline.log.txt

HaploKit commented 2 years ago

Thanks for your interests.

From the log file, it seems not a memory error, and from my experience, your data size(100~400Mb fasta, 9kb genome size) should be acceptable by Strainline. The parameters setting looks fine. I am not clear why it did not work. I guess you can run Strainline on the example data successfully ? If yes, then maybe you could try to reformat your input fasta file to see if it works, for example, (1) using wrapped fasta : >Read1 ATCTTTTAAAATTT
TTTACCCGGGGGG
TTTAAACCC

(2) removing space in fasta headers, like this: >Read1 not >Read1 OtherInformation

Please let me know if it helps or not. Thanks.

antoine4ucsd commented 2 years ago

good suggestion. I will try I was able to make it work when reducing -k to 50. trying -k 80 now thoughts on that?

thank you

HaploKit commented 2 years ago

Good to hear that. I have no clue yet. You could try different k values, and choose an available one. In general, the final results are robust to k if not varying a lot.

antoine4ucsd commented 2 years ago

seems to work with k=80 across all samples (so far). thank you any other suggestions/comments regarding the specific project (SIV full genome sequencing to get representative haplotypes). I am storing depth, frequency and length in my codes but I would like to output good metrics for quality eval. I can't find it in the 'standard' output. do you have additional codes to save/store/plot these metrics? any red flags to discard/keep the haplotypes? Sorry for all the questions . thank you! happy to continue the discussoin via mp

thank you!

HaploKit commented 2 years ago

No problem. The evaluation code is deposited in the evaluation/ folder. You could refer to it, but this only works when the gold standard is provided. Not sure if this is what you want.

You could tune parameters such as --maxGD --maxLD --minAbun --minIdentity to discard or keep haplotypes. I do not know the read length of your data, if the read length is too short, and you find the output haplotypes are obviously shorter than the truth haplotypes, you could try to increase the value of parameter --iter (e.g. 3, 4, default is 2);

antoine4ucsd commented 2 years ago

thank you

antoine4ucsd commented 2 years ago

Still having memory issue for some samples... weird. on my server, I used

#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --mem=96G

for HIV/SIV, I was considering the following parameters. May I ask which parameters did you use in your benchmarks for HIV?

-k 100 --maxGD 0.005 --maxLD 0.001 --minOvlpLen 1000 --minSeedLen 2000 --minAbun 0.01 --maxOH 20 -p ont --minIdentity 0.995  -t 48

based on your suggestions, I will add

--iter 4

all suggestions for optimizing our settings are very welcome. I can also share a couple of samples if you can/want look into it.

thank you

antoine4ucsd commented 2 years ago

(I also noticed some weird haplotypes with the above parameters, with length 15-20kb when FL genome is expected ~9kb. can it be optimized upfront? thank you)