esolares / HapSolo

Reduction of Althaps and Duplicate Contigs for Improved Hi-C Scaffolding
GNU General Public License v2.0
19 stars 6 forks source link

Error: HapSolo has two seperate set of contigs! #9

Open tdkaliki opened 2 years ago

tdkaliki commented 2 years ago

Hello!

I try to run HapSolo but I get the same error all the time with different minimap2 outputs.

Error: HapSolo has two seperate set of contigs! Please submit bug report and sent bugreport.log file.

Here is a bug report for one of the runs.

bugreport.log

yschoeneberg commented 1 year ago

Hi all, I ran into the same issue, and I think it is caused by a bug in line 789 of the script hapsolo.py. I think there is a len() missing to get an integer from a set substraction. Hence, this line should say: if len(myGoodContigsSet - set(myContigsDict.keys())) != 0: Best, Yannis

esolares commented 1 year ago

Hi,

Thank you Yannis! very good catch. I also found that sometimes an empty string in the goodcontigset. I hard coded the removal of it, but will work on a better implementation for check of empty strings in the goodcontigset that was also causing an unexpected exit in the program.

spoonbender76 commented 1 year ago

Hi,

I tried minimap2 outputs but still ran into the same issue even with the latest HapSolo.py if len(myGoodContigsSet - set(myContigsDict.keys())) != 0:

Error: HapSolo has two seperate set of contigs! Please submit bug report and sent bugreport.log file.

Here is a bug report for one of the runs.

bugreport.log

esolares commented 1 year ago

Hi @spoonbender76 Could you please pull the latest commit and try again? I noticed I accidentally removed the dictionary initialization in the last commit. I also added check to make sure the contigs dictionary generated from the fasta file was not empty. Please let me know if more issues arise. I want to make sure we can resolve this bug.

spoonbender76 commented 1 year ago

Hi,

I pulled the latest commit and noticed some weird behaviours.

I run

conda activate hapsolo
cd ~/ssd/hapsolo-test
~/ssd/HapSolo/hapsolo.py -i p_ctg_new.fa -a p_ctg_new.fa_self_align_1.paf -b ./contigs/busco/ -t 10 -n 5000

conda env is the Python 2.7 with PANDAS. p_ctg_new.fa is generated by preprocessfasta.py -i p_ctg.fa and it's a 1.2GB fasta file before the run. p_ctg_new.fa_self_align_1.paf is minimap2 output from the first set of options.

The programs finished with

Thread: 1 Iteration: 4996 PID: 0.810276774047 QPctMin: 0.728386435669 QRPctMin: 0.625369302542 CostFxnDelta: 0.0 ASMScoreFxn: 2.3683127572
Thread: 1 Iteration: 4997 PID: 0.810376774047 QPctMin: 0.728386435669 QRPctMin: 0.625569302542 CostFxnDelta: 0.0 ASMScoreFxn: 2.3683127572
Thread: 1 Iteration: 4998 PID: 0.810476774047 QPctMin: 0.728386435669 QRPctMin: 0.625769302542 CostFxnDelta: 0.0 ASMScoreFxn: 2.3683127572
Thread: 1 Iteration: 4999 PID: 0.810476774047 QPctMin: 0.728586435669 QRPctMin: 0.625869302542 CostFxnDelta: 0.0 ASMScoreFxn: 2.3683127572
Writing p_ctg_new.fa_1000_0.7867_1.0001to0.9999_0.5066_primary.fasta with score: 2.3613963039

I check the folder ~/ssd/hapsolo-test/asms. There are two files

p_ctg_new.fa_1000_0.7867_1.0001to0.9999_0.5066_primary.fasta
p_ctg_new.fa_1000_0.7867_1.0001to0.9999_0.5066_secondary.fasta

The primary one is exactly same as p_ctg.fa and the secondary is empty.

And p_ctg_new.fa after the run somehow turned into 248.9KB text file. File name didn't change but the content is something unknown. Since github doesn’t support .fa file type so I upload it with .txt suffix added.

2.3683127572,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00691645330021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.

p_ctg_new.fa.txt

I also tried rename p_ctg_new.fa to p_ctg_new_test.fa and rerun, tried rerun with p_ctg_new.fa_self_align_2/3.paf but the behaviours are same.

esolares commented 1 year ago

Hi. Yes this was very odd indeed. I have gone ahead and updated and pushed a fix. I tested this on the mosquito data and achieved 200MB genome with 616 & 653 contigs on 1k and 5k iterations, and is pretty close to what is on the publication.

Please take a look again. There were some issues where I was not using the length which resulted in unexpected behavior's. Thank you again for bringing this to my attention. I hope we are one step closer to resolving these issues. Please let me know what you get now when you have a chance. I have also updated the singularity image in case you are using that instead.

esolares commented 1 year ago

Also that's very strange that your fasta file got overwritten. hapsolo.py does not overwrite fasta files... I'm not sure what happened there. also you might want to just do ln -s p_ctg.fa p_ctg.fasta just in case.

spoonbender76 commented 1 year ago

Hi,

Thank you for the quick fix. I pulled the latest commit and tried again.

I found out that the overwrite thing is caused by the shared filename between p_ctg_new.fa and p_ctg_new.fa_self_align_1.paf.

After I renamed p_ctg_new.fa to p_ctg_hapsolo_test.fa and rerun

~/ssd/HapSolo/hapsolo.py -i p_ctg_hapsolo_test.fasta -a p_ctg_new.fa_self_align_1.paf -b ./contigs/busco/ -t 10 -n 5000 
~/ssd/HapSolo/hapsolo.py -i p_ctg_hapsolo_test.fasta -a p_ctg_new.fa_self_align_2.paf -b ./contigs/busco/ -t 10 -n 5000 
~/ssd/HapSolo/hapsolo.py -i p_ctg_hapsolo_test.fasta -a p_ctg_new.fa_self_align_3.paf -b ./contigs/busco/ -t 10 -n 5000

HapSolo finished without throwing any error. In workdir .scores .deltascoresfiles were generated. p_ctg_hapsolo_test.fa didn't got overwritten. The issue of Error: HapSolo has two seperate set of contigs is fixed.

However, I checked results in asms folder and results are not ideal. From the first set of minimap2 options, the primary one is exactly same as p_ctg.fa and the secondary is empty. The second and the third set of minimap2 options produce primary over 1G which are much larger than purge_dups/purge_haplotigs/FCM result (500~600M). This is another problem though.

esolares commented 1 year ago

Hi.

I see. I will take a look at why it would be over writing the file again. What minimap2 parameters are you using? It could be with how the alignment is being done. Is the genome of the sample highly homozygous? I noticed this when I was running it on P. leucopus. I am working on version 2 with students where this should work better for more homozygous samples.

Thank you again,

Edwin

On Thu, Nov 9, 2023, 10:34 PM spoonbender76 @.***> wrote:

Hi,

Thank you for the quick fix. I pulled the latest commit and tried again.

I found out that the overwrite thing is caused by the shared filename between p_ctg_new.fa and p_ctg_new.fa_self_align_1.paf.

After I renamed p_ctg_new.fa to p_ctg_hapsolo_test.fa and rerun

~/ssd/HapSolo/hapsolo.py -i p_ctg_hapsolo_test.fasta -a p_ctg_new.fa_self_align_1.paf -b ./contigs/busco/ -t 10 -n 5000 ~/ssd/HapSolo/hapsolo.py -i p_ctg_hapsolo_test.fasta -a p_ctg_new.fa_self_align_2.paf -b ./contigs/busco/ -t 10 -n 5000 ~/ssd/HapSolo/hapsolo.py -i p_ctg_hapsolo_test.fasta -a p_ctg_new.fa_self_align_3.paf -b ./contigs/busco/ -t 10 -n 5000

HapSolo finished without throwing any error. In workdir .scores .deltascoresfiles were generated. p_ctg_hapsolo_test.fa didn't got overwritten. The issue of Error: HapSolo has two seperate set of contigs is fixed.

However, I checked results in asms folder and results are not ideal. From the first set of minimap2 options, the primary one is exactly same as p_ctg.fa and the secondary is empty. The second and the third set of minimap2 options produce primary over 1G which are much larger than purge_dups/purge_haplotigs/FCM result (500~600M). This is another problem though.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/esolares/HapSolo/issues/9*issuecomment-1805180638__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!N5JWzyC5zAsFw6BDDMtAxbjTK7jPQWy2mUVLhPCig0wcC-dlMtPO4rz3-vdTzrKdK35sotjlGvtiu8LEci3us4Y$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABBDVPWPV6ADSTJZ7OJSGZTYDXDHJAVCNFSM6AAAAAAR76UEXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBVGE4DANRTHA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!N5JWzyC5zAsFw6BDDMtAxbjTK7jPQWy2mUVLhPCig0wcC-dlMtPO4rz3-vdTzrKdK35sotjlGvtiu8LEvjEGLVY$ . You are receiving this because you commented.Message ID: @.***>

esolares commented 11 months ago

Hi, I have not heard back. I have also reduced the special character error to a warning log instead of a quit on true. I will close this issue out in one week if I do not hear back from you. Hopefully you were able to run HapSolo fine now.

spoonbender76 commented 11 months ago

Hi Edwin,

I apologize for the delay in getting back to you. Time got away from me, and I appreciate your patience.

For minimap2 parameters, I tried three sets you provide in README.MD

minimap2 -t 36 -k19 -w5 -A1 -B2 -O3,13 -E2,1 -s200 -z200 -N50 --min-occ-floor=100 ${QRY} ${QRY} > $(basename ${QRY} .fasta)_self_align.paf
minimap2 -t 36 -P -k19 -w2 -A1 -B2 -O1,6 -E2,1 -s200 -z200 -N50 --min-occ-floor=100 ${QRY} ${QRY} > $(basename ${QRY} .fasta)_self_align.paf
minimap2 -t 36 -P -G 500k -k19 -w2 -A1 -B2 -O2,4 -E2,1 -s200 -z200 -N50 --max-qlen 10000000 --min-occ-floor=100 --paf-no-hit ${QRY} ${QRY} > $(basename ${QRY} .fasta)_self_align.paf

I also tried blat/pblat however the process was too slow and in the end it always got killed due to out of memory (512 GB).

For my sample, It's a small insect collected from the wild. One male and one female were isolated and mated to produce F1 progeny. A single pair was then selected for sibling inbreeding for 10 generations. Given that my sample is very small, I gathered individuals from F10 (50 for short-read sequencing, 200 for hifi sequencing). KMC+GenomeScope2 shows high heterozygosity and weird model.

transformed linear plot for short-read sequencing: kmc_k17_gs2_survey_transformed_linear_plot

transformed linear plot for hifi sequencing: kmc_k21_gs2_hifi_transformed_linear_plot

Histograms from purge_haplotigs/purge_dups are strange too: 2311042008-aligned bam histogram

According to the developer of purge_haplotigs, that smaller peak at 5 is not the duplicated bit but something else. He advised me to use -l 15 -m 80 -h 200 and -a 50 and I got BUSCO score like C:97.6%[S:93.3%,D:4.3%],F:0.9%,M:1.5%. Purge_dups results are similar with custom -l 15 -m 80 -h 200 cutoffs. I also tried some other kmer-based pipeline like khaper/KmerDedup and feed the result to purge_haplotigs/purge_dups. That smaller peak at 5 was still there. I had to use custom cutoffs to remove it and got worse BUSCO results than previous runs.

All these methods are not ideal because in Juicebox manual curation phase, there are too many small duplicated/heterozygous areas I have to remove one by one, which should be automatically removed from purge duplication phase earlier.

2023

These are examples of large duplicated/heterozygous areas and there are many more small areas. After I manually removed these parts, the BUSCO D score went down from 4.3% to 1.7%. I wish they could be automatically resolved before Hi-C scaffolding, but I have not found a solution yet.

Thank you for reading all this and continued support. If there's any further information you need from me, please let me know.

Best regards,

Hank