RolandFaure / Hairsplitter

Software that separates very close sequences that have been collapsed during assembly. Uses only long reads.
GNU General Public License v3.0
26 stars 0 forks source link

ERROR: create_new_contigs failed. #5

Closed FrostFlow13 closed 1 year ago

FrostFlow13 commented 1 year ago

Sorry for already having another bug to report! I was trying to run Hairsplitter today after the new update (one with the multiploid command, one without, both using multithreading). Hairsplitter ran so much faster this time around, and none of the previously problematic steps seemed to have issues!

However, there seems to be a new issue with STAGE 6.


Running (this one was without multiploid)

#!/bin/bash
#SBATCH --time=08:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=28
#SBATCH --account=PAS1802
#SBATCH --job-name=1376_hairsplitter-25kb-keephap-2
#SBATCH --export=ALL
#SBATCH --output=1376_hairsplitter-25kb-keephap-2.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1376/2_flye_assembly-keephap-25kb-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter -t 28

Resulted in

 - Loading all reads from ../1_demul_adtrim/BC15-25kbmin.fastq in memory
 - Loading all contigs from ../8_hairsplitter/tmp/cut_assembly.gfa in memory
 - Loading alignments of the reads on the contigs from ../8_hairsplitter/tmp/reads_on_asm.sam
 - Calling variants on each contig using basic pileup
separating reads on contig CONTIG   edge_34@3   46849   117.353
separating reads on contig CONTIG   edge_14@0   8141    131.349
separating reads on contig CONTIG   edge_40@3   11455   100.049
separating reads on contig CONTIG   edge_12@0   1480    444.816
separating reads on contig CONTIG   edge_23@0   18874   219.034
separating reads on contig CONTIG   edge_15@0   21915   217.477
separating reads on contig CONTIG   edge_45@4   300000  223.291
separating reads on contig CONTIG   edge_1@0    72803   76.6896
separating reads on contig CONTIG   edge_36@0   10083   360.579
separating reads on contig CONTIG   edge_37@1   64272   233.718
separating reads on contig CONTIG   edge_46@0   300000  211.382
separating reads on contig CONTIG   edge_28@6   87656   145.68
separating reads on contig CONTIG   edge_39@0   115223  209.538
separating reads on contig CONTIG   edge_41@1   233709  216.108
separating reads on contig CONTIG   edge_16@1   93636   209.789
separating reads on contig CONTIG   edge_45@0   300000  195.72
separating reads on contig CONTIG   edge_28@5   300000  225.631
separating reads on contig CONTIG   edge_44@4   300000  221.944
separating reads on contig CONTIG   edge_45@2   300000  218.656
separating reads on contig CONTIG   edge_37@0   300000  214.827
separating reads on contig CONTIG   edge_44@1   300000  220.502
separating reads on contig CONTIG   edge_44@0   300000  231.295
separating reads on contig CONTIG   edge_28@4   300000  209.897
separating reads on contig CONTIG   edge_6@1    300000  210.705
separating reads on contig CONTIG   edge_48@0   300000  228.833
separating reads on contig CONTIG   edge_45@1   300000  211.574
separating reads on contig CONTIG   edge_22@0   2532    458.313
separating reads on contig CONTIG   edge_40@1   300000  223.89
separating reads on contig CONTIG   edge_44@3   300000  215.01
separating reads on contig CONTIG   edge_6@0    300000  215.125
separating reads on contig CONTIG   edge_34@1   300000  212.838
separating reads on contig CONTIG   edge_48@1   300000  243.961
separating reads on contig CONTIG   edge_33@0   10239   214.442
separating reads on contig CONTIG   edge_38@0   300000  209.204
separating reads on contig CONTIG   edge_44@2   300000  221.737
separating reads on contig CONTIG   edge_34@2   300000  200.835
separating reads on contig CONTIG   edge_35@0   300000  227.907
separating reads on contig CONTIG   edge_34@0   300000  198.119
separating reads on contig CONTIG   edge_45@5   299225  193.126
separating reads on contig CONTIG   edge_3@0    25469   220.44
separating reads on contig CONTIG   edge_44@6   300000  224.849
separating reads on contig CONTIG   edge_28@2   300000  218.004
separating reads on contig CONTIG   edge_44@9   99227   148.113
separating reads on contig CONTIG   edge_4@0    16120   53.9382
separating reads on contig CONTIG   edge_42@0   1252    12304.8
separating reads on contig CONTIG   edge_7@2    233248  230.62
separating reads on contig CONTIG   edge_44@5   300000  223.86
separating reads on contig CONTIG   edge_28@3   300000  226.373
separating reads on contig CONTIG   edge_47@0   262506  233.746
separating reads on contig CONTIG   edge_40@0   300000  203.257
separating reads on contig CONTIG   edge_45@3   300000  226.299
separating reads on contig CONTIG   edge_35@1   300000  226.719
separating reads on contig CONTIG   edge_41@0   300000  203.041
separating reads on contig CONTIG   edge_28@1   300000  211.533
separating reads on contig CONTIG   edge_40@2   300000  204.561
separating reads on contig CONTIG   edge_44@7   300000  228.214
separating reads on contig CONTIG   edge_32@0   159093  131.672
separating reads on contig CONTIG   edge_7@1    300000  263.177
separating reads on contig CONTIG   edge_44@8   300000  226.218
separating reads on contig CONTIG   edge_28@0   300000  232.642
separating reads on contig CONTIG   edge_16@0   300000  204.921
separating reads on contig CONTIG   edge_38@1   97516   218.851
separating reads on contig CONTIG   edge_6@2    260846  197.358
separating reads on contig CONTIG   edge_48@3   85724   104.09
separating reads on contig CONTIG   edge_48@2   300000  239.204
separating reads on contig CONTIG   edge_46@1   80433   197.154
separating reads on contig CONTIG   edge_7@0    300000  203.408
 - Creating the .gaf file describing how the reads align on the new contigs
 - Creating the new contigs
ERROR racon failed, while running racon -w 500 -e 1 -t 1 ../8_hairsplitter/tmp/reads_11.fasta ../8_hairsplitter/tmp/mapped_11.paf ../8_hairsplitter/tmp/unpolished_11.fasta > ../8_hairsplitter/tmp/polished_11.fasta 2>../8_hairsplitter/tmp/trash.txt
/users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter -t 28
HairSplitter v1.3.3 (github.com/RolandFaure/HairSplitter). Last update: 2023-08-21

    ******************
    *                *
    *  Hairsplitter  *
    *    Welcome!    *
    *                *
    ******************

===== STAGE 1: Cleaning graph of small contigs that are unconnected parts of haplotypes   [ 2023-08-21 14:44:48.560662 ]

 When the assemblers manage to locally phase the haplotypes, they sometimes assemble the alternative haplotype as a separate contig, unconnected in the gfa graph. This affects negatively the performance of Hairsplitter. Let's delete these contigs

 - Mapping the assembly against itself
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/clean_graph assembly_graph.gfa ../8_hairsplitter/tmp/cleaned_assembly.gfa ../8_hairsplitter ../8_hairsplitter/hairsplitter.log 28 minimap2
 - Eliminated small unconnected contigs that align on other contigs

===== STAGE 2: Aligning reads on the reference   [ 2023-08-21 14:44:50.479979 ]

 - Cutting the contigs in chunks of 300000bp to avoid memory issues
 - Converting the assembly in fasta format
 - Aligning the reads on the assembly
 - Running minimap with command line:
      minimap2 ../8_hairsplitter/tmp/cleaned_assembly.fasta ../1_demul_adtrim/BC15-25kbmin.fastq -x map-ont -a --secondary=no -t 28 > ../8_hairsplitter/tmp/reads_on_asm.sam 2> ../8_hairsplitter/tmp/logminimap.txt 
   The log of minimap2 can be found at ../8_hairsplitter/tmp/logminimap.txt

===== STAGE 3: Calling variants   [ 2023-08-21 14:46:07.495951 ]

 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/call_variants ../8_hairsplitter/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq ../8_hairsplitter/tmp/reads_on_asm.sam 28 ../8_hairsplitter/tmp ../8_hairsplitter/tmp/error_rate.txt 0 ../8_hairsplitter/tmp/variants.col ../8_hairsplitter/tmp/variants.vcf

===== STAGE 4: Filtering variants   [ 2023-08-21 14:51:21.968869 ]

 - Filtering variants
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/filter_variants ../8_hairsplitter/tmp/variants.col 0.0121198 28 0 ../8_hairsplitter/tmp/filtered_variants.col ../8_hairsplitter/tmp/variants.vcf ../8_hairsplitter/tmp/variants_filtered.vcf

===== STAGE 5: Separating reads by haplotype of origin   [ 2023-08-21 14:51:51.986796 ]

 - Separating reads by haplotype of origin
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter/tmp/filtered_variants.col 28 0.0121198 0 ../8_hairsplitter/tmp/reads_haplo.gro

===== STAGE 6: Creating all the new contigs   [ 2023-08-21 16:22:00.649424 ]

 This can take time, as we need to polish every new contig using Racon
 Running :  /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0121198 ../8_hairsplitter/tmp/reads_haplo.gro ../8_hairsplitter/tmp 28 ont ../8_hairsplitter/tmp/zipped_assembly.gfa ../8_hairsplitter/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0
ERROR: create_new_contigs failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0121198 ../8_hairsplitter/tmp/reads_haplo.gro ../8_hairsplitter/tmp 28 ont ../8_hairsplitter/tmp/zipped_assembly.gfa ../8_hairsplitter/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0

And running (this one was with multiploid)

#!/bin/bash
#SBATCH --time=08:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=28
#SBATCH --account=PAS1802
#SBATCH --job-name=1376_hairsplitter-25kb-keephap-2-multiploid
#SBATCH --export=ALL
#SBATCH --output=1376_hairsplitter-25kb-keephap-2-multiploid.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1376/2_flye_assembly-keephap-25kb-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-multiploid -m -t 28

Resulted in

 - Loading all reads from ../1_demul_adtrim/BC15-25kbmin.fastq in memory
 - Loading all contigs from ../8_hairsplitter-multiploid/tmp/cut_assembly.gfa in memory
 - Loading alignments of the reads on the contigs from ../8_hairsplitter-multiploid/tmp/reads_on_asm.sam
 - Calling variants on each contig using basic pileup
separating reads on contig CONTIG   edge_22@0   2532    458.313
separating reads on contig CONTIG   edge_15@0   21915   217.477
separating reads on contig CONTIG   edge_36@0   10083   360.579
separating reads on contig CONTIG   edge_1@0    72803   76.6896
separating reads on contig CONTIG   edge_48@3   85724   104.09
separating reads on contig CONTIG   edge_44@9   99227   148.113
separating reads on contig CONTIG   edge_48@2   300000  239.204
separating reads on contig CONTIG   edge_28@6   87656   145.68
separating reads on contig CONTIG   edge_41@1   233709  216.108
separating reads on contig CONTIG   edge_28@1   300000  211.533
separating reads on contig CONTIG   edge_28@5   300000  225.631
separating reads on contig CONTIG   edge_45@2   300000  218.656
separating reads on contig CONTIG   edge_47@0   262506  233.746
separating reads on contig CONTIG   edge_44@1   300000  220.502
separating reads on contig CONTIG   edge_48@0   300000  228.833
separating reads on contig CONTIG   edge_41@0   300000  203.041
separating reads on contig CONTIG   edge_44@8   300000  226.218
separating reads on contig CONTIG   edge_44@0   300000  231.295
separating reads on contig CONTIG   edge_28@4   300000  209.897
separating reads on contig CONTIG   edge_28@0   300000  232.642
separating reads on contig CONTIG   edge_35@1   300000  226.719
separating reads on contig CONTIG   edge_45@1   300000  211.574
separating reads on contig CONTIG   edge_7@2    233248  230.62
separating reads on contig CONTIG   edge_40@1   300000  223.89
separating reads on contig CONTIG   edge_28@2   300000  218.004
separating reads on contig CONTIG   edge_34@1   300000  212.838
separating reads on contig CONTIG   edge_32@0   159093  131.672
separating reads on contig CONTIG   edge_6@0    300000  215.125
separating reads on contig CONTIG   edge_38@1   97516   218.851
separating reads on contig CONTIG   edge_35@0   300000  227.907
separating reads on contig CONTIG   edge_40@3   11455   100.049
separating reads on contig CONTIG   edge_46@0   300000  211.382
separating reads on contig CONTIG   edge_42@0   1252    12304.8
separating reads on contig CONTIG   edge_45@5   299225  193.126
separating reads on contig CONTIG   edge_44@3   300000  215.01
separating reads on contig CONTIG   edge_45@4   300000  223.291
separating reads on contig CONTIG   edge_34@3   46849   117.353
separating reads on contig CONTIG   edge_40@2   300000  204.561
separating reads on contig CONTIG   edge_40@0   300000  203.257
separating reads on contig CONTIG   edge_34@0   300000  198.119
separating reads on contig CONTIG   edge_37@1   64272   233.718
separating reads on contig CONTIG   edge_45@0   300000  195.72
separating reads on contig CONTIG   edge_14@0   8141    131.349
separating reads on contig CONTIG   edge_28@3   300000  226.373
separating reads on contig CONTIG   edge_23@0   18874   219.034
separating reads on contig CONTIG   edge_44@6   300000  224.849
separating reads on contig CONTIG   edge_33@0   10239   214.442
separating reads on contig CONTIG   edge_6@2    260846  197.358
separating reads on contig CONTIG   edge_12@0   1480    444.816
separating reads on contig CONTIG   edge_37@0   300000  214.827
separating reads on contig CONTIG   edge_16@1   93636   209.789
separating reads on contig CONTIG   edge_6@1    300000  210.705
separating reads on contig CONTIG   edge_46@1   80433   197.154
separating reads on contig CONTIG   edge_4@0    16120   53.9382
separating reads on contig CONTIG   edge_44@4   300000  221.944
separating reads on contig CONTIG   edge_34@2   300000  200.835
separating reads on contig CONTIG   edge_44@2   300000  221.737
separating reads on contig CONTIG   edge_16@0   300000  204.921
separating reads on contig CONTIG   edge_48@1   300000  243.961
separating reads on contig CONTIG   edge_38@0   300000  209.204
separating reads on contig CONTIG   edge_44@7   300000  228.214
separating reads on contig CONTIG   edge_39@0   115223  209.538
separating reads on contig CONTIG   edge_7@0    300000  203.408
separating reads on contig CONTIG   edge_7@1    300000  263.177
separating reads on contig CONTIG   edge_45@3   300000  226.299
separating reads on contig CONTIG   edge_44@5   300000  223.86
separating reads on contig CONTIG   edge_3@0    25469   220.44
 - Creating the .gaf file describing how the reads align on the new contigs
 - Creating the new contigs
ERROR racon failed, while running racon -w 500 -e 1 -t 1 ../8_hairsplitter-multiploid/tmp/reads_11.fasta ../8_hairsplitter-multiploid/tmp/mapped_11.paf ../8_hairsplitter-multiploid/tmp/unpolished_11.fasta > ../8_hairsplitter-multiploid/tmp/polished_11.fasta 2>../8_hairsplitter-multiploid/tmp/trash.txt
/users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-multiploid -m -t 28
HairSplitter v1.3.3 (github.com/RolandFaure/HairSplitter). Last update: 2023-08-21

    ******************
    *                *
    *  Hairsplitter  *
    *    Welcome!    *
    *                *
    ******************

===== STAGE 1: Cleaning graph of small contigs that are unconnected parts of haplotypes   [ 2023-08-21 14:44:56.992370 ]

 When the assemblers manage to locally phase the haplotypes, they sometimes assemble the alternative haplotype as a separate contig, unconnected in the gfa graph. This affects negatively the performance of Hairsplitter. Let's delete these contigs

 - Mapping the assembly against itself
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/clean_graph assembly_graph.gfa ../8_hairsplitter-multiploid/tmp/cleaned_assembly.gfa ../8_hairsplitter-multiploid ../8_hairsplitter-multiploid/hairsplitter.log 28 minimap2
 - Eliminated small unconnected contigs that align on other contigs

===== STAGE 2: Aligning reads on the reference   [ 2023-08-21 14:44:58.965748 ]

 - Cutting the contigs in chunks of 300000bp to avoid memory issues
 - Converting the assembly in fasta format
 - Aligning the reads on the assembly
 - Running minimap with command line:
      minimap2 ../8_hairsplitter-multiploid/tmp/cleaned_assembly.fasta ../1_demul_adtrim/BC15-25kbmin.fastq -x map-ont -a --secondary=no -t 28 > ../8_hairsplitter-multiploid/tmp/reads_on_asm.sam 2> ../8_hairsplitter-multiploid/tmp/logminimap.txt 
   The log of minimap2 can be found at ../8_hairsplitter-multiploid/tmp/logminimap.txt

===== STAGE 3: Calling variants   [ 2023-08-21 14:46:15.601267 ]

 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/call_variants ../8_hairsplitter-multiploid/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq ../8_hairsplitter-multiploid/tmp/reads_on_asm.sam 28 ../8_hairsplitter-multiploid/tmp ../8_hairsplitter-multiploid/tmp/error_rate.txt 0 ../8_hairsplitter-multiploid/tmp/variants.col ../8_hairsplitter-multiploid/tmp/variants.vcf

===== STAGE 4: Filtering variants   [ 2023-08-21 14:51:39.988027 ]

 - Filtering variants
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/filter_variants ../8_hairsplitter-multiploid/tmp/variants.col 0.0121198 28 0 ../8_hairsplitter-multiploid/tmp/filtered_variants.col ../8_hairsplitter-multiploid/tmp/variants.vcf ../8_hairsplitter-multiploid/tmp/variants_filtered.vcf

===== STAGE 5: Separating reads by haplotype of origin   [ 2023-08-21 14:52:10.334915 ]

 - Separating reads by haplotype of origin
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter-multiploid/tmp/filtered_variants.col 28 0.0121198 0 ../8_hairsplitter-multiploid/tmp/reads_haplo.gro

===== STAGE 6: Creating all the new contigs   [ 2023-08-21 16:14:06.974230 ]

 This can take time, as we need to polish every new contig using Racon
 Running :  /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter-multiploid/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0121198 ../8_hairsplitter-multiploid/tmp/reads_haplo.gro ../8_hairsplitter-multiploid/tmp 28 ont ../8_hairsplitter-multiploid/tmp/zipped_assembly.gfa ../8_hairsplitter-multiploid/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0
ERROR: create_new_contigs failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter-multiploid/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0121198 ../8_hairsplitter-multiploid/tmp/reads_haplo.gro ../8_hairsplitter-multiploid/tmp 28 ont ../8_hairsplitter-multiploid/tmp/zipped_assembly.gfa ../8_hairsplitter-multiploid/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0

Both appear to be the same error, so I don't think it's the multiploid argument doing anything, but it's definitely not something I encountered previously, and this is the same dataset I ran last week, the only difference being multithreading (but I'm not certain it's multithreading at fault here). I did look at the commits and noticed that src/tools.cpp was changed to allow an exit() during polishing, but given that it didn't have a problem during the Minimap2 step (which was also given an exit()), I don't know why racon would end up having an issue.

An oddity I just noticed during this - the variants_filtered.vcf file never seems to have much added to it, even in my successful runs of Hairsplitter. All it seems to have is:

##fileformat=VCFv4.2
##source=call_variants
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

Lastly, I don't know if it's helpful, but it looks like Hairsplitter tried to make a /tmp/ directory and a trash.txt file in my assembly folder (i.e. in the /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1376/2_flye_assembly-keephap-25kb-2/ directory). I don't know if it has done that every time and simply deleted it later, or if this is a new bug and it simply happened to leave the files there because Hairsplitter died before it could remove them. There are also a lot more files in the proper /8_hairsplitter/tmp/tmp/ directory than there were previously, like it wasn't deleting the files as it was running: image

RolandFaure commented 1 year ago

Hello, The bug occured when an haplotype was very divergent from the assembly: HairSplitter tried to polish the assembly with the reads, but failed to map the reads on the assembly. I added a reassembly module for the cases where the reads are too divergent. This should also improve the phasing of the large insertions/deletions/inversions. A new version has been pushed and released :-)

FrostFlow13 commented 1 year ago

Wonderful - thank you so much! I'll give it another try again today (one with multiploid and one without again) and see how it goes.