RolandFaure / Hairsplitter

Software that separates very close sequences that have been collapsed during assembly. Uses only long reads.
GNU General Public License v3.0
26 stars 0 forks source link

ERROR: GraphUnzip failed. #6

Closed FrostFlow13 closed 10 months ago

FrostFlow13 commented 1 year ago

It moved further along this time, but it looks like both my runs failed at the GraphUnzip step. I will note that I used a .gfa file output from Flye as the assembly input, BUT I had modified it to remove some problematic telomeric sequence, which might have caused an issue. The file opened up fine in Bandage, and Hairsplitter made a cleaned_assembly.gfa file just fine, so I'm not sure if that contributed or not. Just so it's up at the top, though, the zipped_assembly.gfa files look fantastic now (apart from a minor issue which I discuss near the bottom of this post).

Both of the runs (one multiploid, the other not) ended with:

===== STAGE 7: Untangling (~scaffolding) the new assembly graph to improve contiguity   [ 2023-08-23 14:00:51.693528 ]

 - Running GraphUnzip with command line:
      python /users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py unzip -l ../8_hairsplitter/tmp/reads_on_new_contig.gaf -g ../8_hairsplitter/tmp/zipped_assembly.gfa -o ../8_hairsplitter/hairsplitter_final_assembly.gfa --meta 2>../8_hairsplitter/tmp/logGraphUnzip.txt >../8_hairsplitter/tmp/trash.txt 
   The log of GraphUnzip is written on  ../8_hairsplitter/tmp/logGraphUnzip.txt

ERROR: GraphUnzip failed. Please check the output of GraphUnzip in ../8_hairsplitter/tmp/logGraphUnzip.txt

For the non-multiploid run

The logGraphUnzip.txt file was empty, but trash.txt had this inside:

Loading the GFA file
Loading contigs
WARNING: contig  ['edge_1@0_72804_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_1@0_72804_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_12@0_1481_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_12@0_1481_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@4_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@4_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_22@0_2533_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_22@0_2533_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_14@0_8142_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_14@0_8142_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_3@0_25470_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_3@0_25470_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_33@0_10240_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_33@0_10240_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_38@1_97517_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_38@1_97517_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_16@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_16@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_23@0_18875_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_23@0_18875_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_15@0_21916_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_15@0_21916_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_16@1_93637_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_16@1_93637_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_4@0_16121_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_4@0_16121_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_46@1_80434_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_46@1_80434_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_34@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_34@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@5_295292_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@5_295292_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_39@0_115224_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_39@0_115224_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_41@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_41@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_32@0_137974_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_32@0_137974_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_34@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_34@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_34@3_46850_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_34@3_46850_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_6@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_6@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_48@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_48@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_37@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_37@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@8_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@8_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_46@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_46@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@5_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@5_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@5_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@5_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_40@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_40@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_34@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_34@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_35@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_35@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_37@1_64273_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_37@1_64273_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_48@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_48@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@9_99228_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@9_99228_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@6_85555_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@6_85555_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_47@0_262507_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_47@0_262507_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_36@0_10084_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_36@0_10084_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_48@3_62608_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_48@3_62608_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_41@1_233710_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_41@1_233710_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_40@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_40@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_6@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_6@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_38@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_38@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@3_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@3_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_40@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_40@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_40@3_11456_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_40@3_11456_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@3_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@3_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@6_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@6_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_42@0_1253_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_42@0_1253_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@3_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@3_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_35@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_35@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_48@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_48@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_6@2_260847_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_6@2_260847_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@4_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@4_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@7_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@7_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@4_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@4_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_7@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_7@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_7@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_7@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_7@2_233249_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_7@2_233249_0']  has length = 0. This might infer in handling the coverage
WARNING:  67  contigs out of  1525  had no coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
================

Everything loaded, moving on to untangling the graph

================

*Untangling the graph using long reads*

Reading the gaf file...
Finished going through the gaf file.
here are all the links of  edge_44@9_0_0  :  [['edge_44@8_300001_0']]   [1]
Here is the gaf line :  ('aadd4b75-6dae-421f-a428-8bd1744e25f8', '<edge_44@9_0_0<edge_44@8_296000_0')
WARNING: discrepancy between what's found in the alignment files and the inputted GFA graph. Link  ['edge_44@9_0_0', 'edge_44@8_296000_0'] <<  not found in the gfa

For the multiploid run

The logGraphUnzip.txt file had the following in it (the trash.txt file looked very similar to the one above, but stopped at the Finished going through the gaf file line):

Traceback (most recent call last):
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py", line 436, in <module>
    main()
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py", line 388, in main
    segments = simple_unzip(segments, names, lrFile)
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/simple_unzip.py", line 253, in simple_unzip
    if best_pair_for_each_left_link[p[0]][0] < pairs[p] :
IndexError: list index out of range

Looks like it's almost there! The zipped_assembly.gfa from both the multiploid and non-multiploid run look excellent now - the contigs in those files are MUCH larger, AND I checked them against the field's current phased reference assembly (which has a number of issues, but is at least decent for checking phasing) - the homologous contigs in the zipped_assembly.gfa seem to be VERY well phased, as they each agree equally well with only one of the phased references (example: edge_40@2_144000_1 matches 99.7% with Chr4A and 98.8% with Chr4B, and edge_40@2_144000_0 matches 99.6% with Chr4B and 98.7% with Chr4A). This is very, very exciting, especially because some of the phased contigs Hairsplitter is generating in the zipped_assembly.gfa file are 150,000+ bp large!

There is an issue with the zipped_assembly.gfa file, though - every single contig it generated, when visualized in Bandage, has "Depth: 0.0X". Maybe that's contributing to the issue? I'm not sure.

From the error logs, it looks like it's just GraphUnzip that is the issue now (every other file apart from those that should be produced after the GraphUnzip step look good).

If you need me to provide any files to try to solve this one, just let me know!

RolandFaure commented 1 year ago

Great to see it's progressing ! I've found a bug in GraphUnzip but I'm not 100% sure it was the one you stumbled upon. I have pushed the correction, tell me if it still bugs. If it does, please transfer me the files and I'll give your data a try.

FrostFlow13 commented 1 year ago

UPDATE1

I just finished a number of runs with the newest version of Hairsplitter (one that has your commits from today), and Hairsplitter seems to successfully unzip portions of the genome via GraphUnzip! For example, one of the chromosomes went from being ~240 nodes in the zipped assembly to being ~25 nodes in the final assembly. This is very exciting, and it also looks like you may have fixed the bug in GraphUnzip, considering it's working (at least, it seems to be working considering what I saw). Below is an example of the results of one of the successful non-multiploid runs ("1" and "2" are the same position on the chromosome, just for visualization purposes): image

However, there does seem to be a degree of inconsistency in STAGE 6 now. A few non-multiploid runs failed at that step, and many multiploid runs failed there as well. All seemed to have the same error, with the following examples from two of the runs.


Non-multiploid:

#!/bin/bash
#SBATCH --time=05:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=28
#SBATCH --account=PAS1802
#SBATCH --job-name=CC16_hairsplitter-25kb-1-selfrun
#SBATCH --export=ALL
#SBATCH --output=CC16_hairsplitter-25kb-1-selfrun.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2019-tlo_KO_LongRead_Dublin/CC16/2_flye_assembly-25kb-1/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../raw_reads/CC16-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-selfrun -t 28

Results in:

===== STAGE 6: Creating all the new contigs   [ 2023-08-24 13:10:32.914369 ]

 This can take time, as we need to polish every new contig using Racon
 Running :  /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter-selfrun/tmp/cut_assembly.gfa ../raw_reads/CC16-25kbmin.fastq 0.0443459 ../8_hairsplitter-selfrun/tmp/reads_haplo.gro ../8_hairsplitter-selfrun/tmp 28 ont ../8_hairsplitter-selfrun/tmp/zipped_assembly.gfa ../8_hairsplitter-selfrun/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0
ERROR: create_new_contigs failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter-selfrun/tmp/cut_assembly.gfa ../raw_reads/CC16-25kbmin.fastq 0.0443459 ../8_hairsplitter-selfrun/tmp/reads_haplo.gro ../8_hairsplitter-selfrun/tmp 28 ont ../8_hairsplitter-selfrun/tmp/zipped_assembly.gfa ../8_hairsplitter-selfrun/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0

Multiploid:

#!/bin/bash
#SBATCH --time=05:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=28
#SBATCH --account=PAS1802
#SBATCH --job-name=CC16_hairsplitter-25kb-1_multiploid-selfrun
#SBATCH --mail-user=woodruff.207@buckeyemail.osu.edu
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --export=ALL
#SBATCH --output=CC16_hairsplitter-25kb-1_multiploid-selfrun.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2019-tlo_KO_LongRead_Dublin/CC16/2_flye_assembly-25kb-1/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../raw_reads/CC16-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-multiploid-selfrun -m -t 28

Results in:

===== STAGE 6: Creating all the new contigs   [ 2023-08-24 13:37:26.717152 ]

 This can take time, as we need to polish every new contig using Racon
 Running :  /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter-multiploid-selfrun/tmp/cut_assembly.gfa ../raw_reads/CC16-25kbmin.fastq 0.0443459 ../8_hairsplitter-multiploid-selfrun/tmp/reads_haplo.gro ../8_hairsplitter-multiploid-selfrun/tmp 28 ont ../8_hairsplitter-multiploid-selfrun/tmp/zipped_assembly.gfa ../8_hairsplitter-multiploid-selfrun/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0
ERROR: create_new_contigs failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter-multiploid-selfrun/tmp/cut_assembly.gfa ../raw_reads/CC16-25kbmin.fastq 0.0443459 ../8_hairsplitter-multiploid-selfrun/tmp/reads_haplo.gro ../8_hairsplitter-multiploid-selfrun/tmp 28 ont ../8_hairsplitter-multiploid-selfrun/tmp/zipped_assembly.gfa ../8_hairsplitter-multiploid-selfrun/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0

It seems like the problem is create_new_contigs, as both of those runs report that step as failing. What's weird is that, as I mentioned, the error is inconsistent. I've had several non-multiploid runs succeed (and several fail), and I've had only two multiploid runs succeed as well (with many failed runs). There's no real rhyme or reason for why they seem to be failing, either - I run the script the same way on the same data each time, and they fail at random. What's more frustrating (at least for me) is that there's no clear error this time around - nearly every single run has simply said create_new_contigs failed without specifying anywhere why. The only failed run so far that gave any indication as to what really happened noted this:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

But that's the ONLY failed run that did that. Despite this inconsistency though, it DOES seem like Hairsplitter is working start-to-end for me now, which is incredibly exciting!


Lastly, I'm not sure if the -m multiploid argument is necessarily working as expected? For the non-multiploid (i.e. did not use -m) assembly, it's beautifully connected and looks incredibly similar to the graphs you've shown in your poster and presentation! However, for the multiploid (i.e. DID use -m) assembly, it's fragmented like I had shown in Issue #4, which feels odd if the multiploid argument is theoretically supposed to allow stronger assumptions/decrease fragmentation (though I know you mentioned you had not tested it much yet).

I'm not certain it's a BIG issue, considering default Hairsplitter seem to be doing a great job reassembling the chromosomes now, but it is something to know I guess. Here's an example: image

FrostFlow13 commented 1 year ago

UPDATE2

Actually, additional update - the previous post was technically on a different dataset (it's a different strain/older dataset from another lab and could be run very quickly). I tried this again on the main dataset I've been trying to analyze, and here's what I got:


Non-multiploid:

#!/bin/bash
#SBATCH --time=05:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=28
#SBATCH --account=PAS1802
#SBATCH --job-name=1376_hairsplitter-25kb-keephap-2_run2
#SBATCH --export=ALL
#SBATCH --output=1376_hairsplitter-25kb-keephap-2_run2.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1376/2_flye_assembly-keephap-25kb-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i 1376-25kb-kh-2-assem_garph-telotrim-rename.gfa -x ont -o ../8_hairsplitter_run2 -t 28

Resulted in:

===== STAGE 6: Creating all the new contigs   [ 2023-08-24 13:50:53.435182 ]

 This can take time, as we need to polish every new contig using Racon
 Running :  /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter_run2/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0123806 ../8_hairsplitter_run2/tmp/reads_haplo.gro ../8_hairsplitter_run2/tmp 28 ont ../8_hairsplitter_run2/tmp/zipped_assembly.gfa ../8_hairsplitter_run2/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0
ERROR: create_new_contigs failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter_run2/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0123806 ../8_hairsplitter_run2/tmp/reads_haplo.gro ../8_hairsplitter_run2/tmp 28 ont ../8_hairsplitter_run2/tmp/zipped_assembly.gfa ../8_hairsplitter_run2/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0

Multiploid:

#!/bin/bash
#SBATCH --time=05:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=28
#SBATCH --account=PAS1802
#SBATCH --job-name=1376_hairsplitter-25kb-keephap-2-multiploid_run2
#SBATCH --export=ALL
#SBATCH --output=1376_hairsplitter-25kb-keephap-2-multiploid_run2.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1376/2_flye_assembly-keephap-25kb-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i 1376-25kb-kh-2-assem_garph-telotrim-rename.gfa -x ont -o ../8_hairsplitter-multiploid_run2 -m -t 28

Resulted in:


 - Running GraphUnzip with command line:
      python /users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py unzip -l ../8_hairsplitter-multiploid_run2/tmp/reads_on_new_contig.gaf -g ../8_hairsplitter-multiploid_run2/tmp/zipped_assembly.gfa -o ../8_hairsplitter-multiploid_run2/hairsplitter_final_assembly.gfa 2>../8_hairsplitter-multiploid_run2/tmp/logGraphUnzip.txt >../8_hairsplitter-multiploid_run2/tmp/trash.txt 
   The log of GraphUnzip is written on  ../8_hairsplitter-multiploid_run2/tmp/logGraphUnzip.txt

ERROR: GraphUnzip failed. Please check the output of GraphUnzip in ../8_hairsplitter-multiploid_run2/tmp/logGraphUnzip.txt

The logGraphUnzip.txt file is as follows:

Traceback (most recent call last):
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py", line 436, in <module>
    main()
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py", line 388, in main
    segments = simple_unzip(segments, names, lrFile)
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/simple_unzip.py", line 253, in simple_unzip
    if best_pair_for_each_left_link[p[0]][0] < pairs[p] :
IndexError: list index out of range

Looks like this isn't the same issue I had that prompted this Issue #6, but it still seems to hit something during GraphUnzip AND the same issue I mentioned in UPDATE1 with create_new_contigs. I'll try setting up two more runs to see if they go through. It seems like when it works it works very VERY well, but it seems to have some sort of inconsistency issue right now, maybe?

As you requested for your troubleshooting purposes, I've uploaded my assembly 1376-25kb-kh-2-assem_garph-telotrim-rename.gfa and my sequences BC15-25kbmin.fastq to the same OneDrive I've shared with you previously.

I have also uploaded assembly_graph.gfa and CC16-25kbmin.fastq, which are the other datasets I mentioned in UPDATE1 that ran much, much faster (within minutes as opposed to the 2+ hours for my main dataset), just in case you want a set of data that doesn't take as long to process. They might not necessarily be equivalent for testing, though, as the CC16 dataset is from ONT ran in 2019 and the newer BC15 dataset is from ONT ran earlier this year, and I know with certainty that the CC16 dataset has a significantly higher error rate.


I checked the two runs in the morning - the non-multiploid finished this time, and the multiploid failed at the GraphUnzip step again with the same error. The zipped_assembly.gfa file still shows 0X depth for all contigs, but that's still not that much of an issue (I think).

Apart from these errors, the non-multiploid phased assembly of my main dataset looks very good! Some of the phased regions are even 1 Mb or more, which is honestly more than I was ever expecting! There are a few regions that seem a bit problematic upon checking the sequence and seeing how the long reads align against the output/against previous assemblies, but those are regions that I knew might cause some issues. For example: some very repetitive (and long) regions definitely struggled (unsurprisingly - there's only so much a program can do to try to get through it, especially when there are other, very similar sequences from other chromosomes that could confuse them); some regions that are duplicated across some of the subtelomeres; and some regions that have both heterozygous inversions AND heterozygous deletions inside the inversions (for one of those complex hetinv+hetdel regions, it looks like it properly identified where the inversion AND the heterozygous deletion happened, but just dropped the sequence completely - I can see the supercontigs it made for them, but they're 0-2 bp large).

RolandFaure commented 12 months ago

Hi, Thank you for your feedback This is a little bit strange. I ran both of your datasets (the CC16 several times and the BC15 once) and I did not run into any problems. Of course if the problem occurs randomly this is no guarantee. Could you check that you are running the version currently uploaded on the master branch. If you are and GraphUnzip still fails, could you provide the files that GraphUnzip takes as input ? (../8_hairsplitter-multiploid_run2/tmp/reads_on_new_contig.gaf, ../8_hairsplitter-multiploid_run2/tmp/zipped_assembly.gfa)

FrostFlow13 commented 12 months ago

I haven't had a chance to test it yet today, but I did check a few of the files to verify that I am using the version from the master branch. From what I can see, it appears that this should be the most up-to-date version, as the files contain all of the changes I can see from commit d9b9b42 (which should be the most recent version from what I understand). Looking at modification times, it looks like I also had ran the git clone https://github.com/RolandFaure/Hairsplitter.git command several hours after you had pushed commit d9b9b42, so I feel fairly confident that the version I pulled should be up-to-date.

I'll try running it a few more times to see if it has issues again, and if it does I'll go ahead and send you the requested files.

FrostFlow13 commented 10 months ago

Sorry for never having followed this up! While Hairsplitter was incredibly useful at the time, I found an alternative method for regenerating my haplotypes that worked for my dataset through calling variant positions, phasing the calls, then tagging the reads by haplotype and splitting them into two read groups, after which I generated two new assemblies.

You can go ahead and close this if you want, or leave it open in case someone else ever has the error as well.

RolandFaure commented 10 months ago

Ok, no problems. I will try in the following months to create a HairSplitter version tailored for diploid and polyploid organisms. For now HairSplitter does not make any assumption about coverage, while these could be powerful assumptions if used correctly. Stay tuned !