Output explanantion - Hairsplitter

RolandFaure / Hairsplitter

Software that separates very close sequences that have been collapsed during assembly. Uses only long reads.

GNU General Public License v3.0

34 stars 0 forks source link

Output explanantion - Hairsplitter #7

Closed alexvasilikop closed 1 year ago

alexvasilikop commented 1 year ago

Hello Roland,

I was wondering how to interpret the output of Hairsplitter

 - Between positions 96000 and 101999 of the contig, I've created these contigs:
   - Scaffold_1@1_96000_1
   - Scaffold_1@1_96000_0
 - Between positions 102000 and 105999 of the contig, I've created these contigs:
   - Scaffold_1@1_102000_2
   - Scaffold_1@1_102000_1
   - Scaffold_1@1_102000_0

Does this mean that Hairsplitter outputs different ploidies for different parts of the original assembly? You can see here that both intervals are on the same original scaffold (Scaffold_1). It was not very clear from the documantation that the software computes ploidy separately for different parts of the original scaffold. Does this above mean that in the first interval it detects 2 haplotypes but in the second interval 3 haplotypes (possible aneuploidy)?

Is there a way to see which supercontigs are different alleles of other supercontigs in the final assembly?

Thanks Alex

RolandFaure commented 1 year ago

Hi Alex,

As you noticed, HairSplitter does not split all parts of a contig in the same number of haplotypes in the first step of the algorithm. This does NOT mean there is aneuploidy in the genome - in you case, it probably means that two of the three haplotypes are homozygous between position 96000 and 102000, or that the region between 102000 and 106000 is a repeated region that occurs at least three times in the original genome.

All these haplotypes are then untangled (~scaffolded) in the last step of HairSplitter. If you want more details, you can inspect the tmp/zipped_assembly.gfa. If you run HairSplitter with parameter -s (don't simplify), the resulting contigs will have names such as ..._Scaffold_1@1_96000_1_Scaffold_1@1_102000_2_... This means that the contig is the concatenation of Scaffold_1@1_96000_1 and Scaffold_1@1_102000_2. The contig containing Scaffold_1@1_96000_1 is homologous with the contig containing Scaffold_1@1_96000_0

I will try to make it clearer in the output of HairSplitter.

alexvasilikop commented 1 year ago

Thanks Roland, Is there a simple way to see which of the output scaffolded contigs are actually homologous sequences without having to look the names of subcontigs contained in each (maybe to be added in future versions?).

In my case, I am just interested to get an estimate of the ploidy of this genome by looking at the number of contigs / regions that have a specific ploidy level (2, 3, 4 alleles etc). For example if most supercontigs (or regions) have only 1 homolog, then the genome should be (mostly) diploid?

Why does the output contain more scaffolds than my input genome which is chromosome-level? Does Hairsplitter breaks scaffold connections based on coverage?

Thanks again Alex

RolandFaure commented 1 year ago

The best thing that comes to my mind now would be to align the obtained supercontigs on the original haploid assembly, and visualize using e.g. Tablet, or do it the other way around and visualize using Bandage.

alexvasilikop commented 1 year ago

Hi Roland,

I cannot find the parameter -r. Is this implemented in version 1.3.4?

python3 /mnt/sda1/Alex/software/Hairsplitter/hairsplitter.py -h
usage: hairsplitter.py [-h] -i ASSEMBLY -f FASTQ [-x TECHNOLOGY] [-m] [-t THREADS] -o OUTPUT [-s] [-p] [-F] [--path_to_minimap2 PATH_TO_MINIMAP2] [--path_to_racon PATH_TO_RACON]
                       [--path_to_samtools PATH_TO_SAMTOOLS] [-d] [-v]

options:
  -h, --help            show this help message and exit
  -i ASSEMBLY, --assembly ASSEMBLY
                        Original assembly in GFA or FASTA format (required)
  -f FASTQ, --fastq FASTQ
                        Sequencing reads fastq (required)
  -x TECHNOLOGY, --technology TECHNOLOGY
                        {ont, pacbio, hifi} [ont]
  -m, --multiploid      Use this option if all haplotypes can be assumed to have the same coverage
  -t THREADS, --threads THREADS
                        Number of threads [1]
  -o OUTPUT, --output OUTPUT
                        Output directory
  -s, --dont_simplify   Don't rename the contigs and don't merge them
  -p, --polish-everything
                        Polish every contig with racon, even those
  -F, --force           Force overwrite of output folder if it exists
  --path_to_minimap2 PATH_TO_MINIMAP2
                        Path to the executable minimap2 [minimap2]
  --path_to_racon PATH_TO_RACON
                        Path to the executable racon [racon]
  --path_to_samtools PATH_TO_SAMTOOLS
                        Path to samtools [samtools]
  -d, --debug           Debug mode
  -v, --version         Print version and exit

best alex

RolandFaure commented 1 year ago

Sorry, the parameter is -s (dont simplify)

alexvasilikop commented 1 year ago

Hello Roland,

it seems that in the output assembly after using the -r option there is no scaffolding performed (on the contrary, although the original assembly was 11 scaffolds, the ouput is 19634). But it might be because the original assembly was scaffolded with HiC. In any case the contigs look like this:

>Scaffold_1@11_6000_1-0
>Scaffold_1@11_6000_0-0
>Scaffold_1@11_58000_0-1
>Scaffold_1@11_58000_0-0
>Scaffold_1@11_54000_1-0
>Scaffold_1@11_54000_0-0
>Scaffold_1@11_4000_0-1
>Scaffold_1@11_4000_0-0
>Scaffold_1@11_262000_1-0
>Scaffold_1@11_262000_0-0
>Scaffold_1@11_260000_1-0
>Scaffold_1@11_260000_0-0
>Scaffold_1@11_18000_0-1
>Scaffold_1@11_18000_0-0
>Scaffold_1@11_178000_0-0
>Scaffold_1@11_176000_1-0
>Scaffold_1@11_176000_0-0
>Scaffold_1@11_16000_1-0
>Scaffold_1@11_16000_0-0
>Scaffold_1@11_12000_0-1
>Scaffold_1@11_12000_0-0

There is only one scaffold always in the contig name which shows the output is actually less contiguous than the input (since you have many contigs from the same input scaffold). I was wondering what the last numbers in the name mean. For example:

Scaffold_1@11_12000_0-0
Scaffold_1@11_12000_0-1

The last numbers correspond to alternative alleles (0,1)? Why is the difference sometimes changed to the last before the end digit?:

Scaffold_1@11_16000_1-0
Scaffold_1@11_16000_0-0

And what does 11_12000 stand for in the first case?

Thanks for your help Alex

alexvasilikop commented 1 year ago

I forgot to mention that by doing a quick modification like this of the headers, I get the number of potential alleles for the same region? Let meknow if I am wrong (it is based on the assumption that the last 2 numbers are different alleles.

grep ">" hairsplitter_final_assembly.fasta| sort -nr| cut -f1,2,3 -d "_"|sort |uniq -c

      2 >Scaffold_6@8_200000
      2 >Scaffold_6@8_202000
      2 >Scaffold_6@8_208000
      2 >Scaffold_6@8_210000
      2 >Scaffold_6@8_220000
      2 >Scaffold_6@8_222000
      2 >Scaffold_6@8_24000
      2 >Scaffold_6@8_242000
      2 >Scaffold_6@8_244000
      2 >Scaffold_6@8_26000
      2 >Scaffold_6@8_28000
      2 >Scaffold_6@8_30000
      2 >Scaffold_6@8_86000
      2 >Scaffold_6@8_88000
      2 >Scaffold_6@9_0
      2 >Scaffold_6@9_102000
      2 >Scaffold_6@9_104000
      2 >Scaffold_6@9_106000
      2 >Scaffold_6@9_108000
      2 >Scaffold_6@9_110000
      2 >Scaffold_6@9_112000
      2 >Scaffold_6@9_114000
      2 >Scaffold_6@9_116000
      2 >Scaffold_6@9_186000
      2 >Scaffold_6@9_188000
      2 >Scaffold_6@9_194000
      2 >Scaffold_6@9_196000
      2 >Scaffold_6@9_218000
      2 >Scaffold_6@9_220000
      2 >Scaffold_6@9_226000
      2 >Scaffold_6@9_228000
      2 >Scaffold_6@9_238000
      2 >Scaffold_6@9_240000
      2 >Scaffold_6@9_250000
      2 >Scaffold_6@9_252000
      3 >Scaffold_6@9_260000
      3 >Scaffold_6@9_262000
      3 >Scaffold_6@9_264000
      3 >Scaffold_6@9_266000
      3 >Scaffold_6@9_268000
      2 >Scaffold_6@9_276000
      2 >Scaffold_6@9_30000
      2 >Scaffold_6@9_32000
      2 >Scaffold_6@9_98000
      1 >Scaffold_7@0_0
      2 >Scaffold_7@0_290000
      1 >Scaffold_7@0_294000
      1 >Scaffold_7@1-0
      2 >Scaffold_8@0_0
      2 >Scaffold_8@0_126000
      2 >Scaffold_8@0_134000
      2 >Scaffold_8@0_136000
      1 >Scaffold_8@0_146000
      2 >Scaffold_8@0_2000
      2 >Scaffold_8@0_230000
      2 >Scaffold_8@0_232000
      2 >Scaffold_8@0_242000
      1 >Scaffold_8@0_244000
      2 >Scaffold_8@0_46000
      2 >Scaffold_8@0_54000
      3 >Scaffold_8@0_58000
      2 >Scaffold_8@0_60000
      2 >Scaffold_8@0_62000
      2 >Scaffold_8@0_66000
      2 >Scaffold_8@0_72000
      1 >Scaffold_8@0_74000
      1 >Scaffold_9@0_0
      2 >Scaffold_9@0_56000
      1 >Scaffold_9@0_58000

Best

RolandFaure commented 1 year ago

Hi,

When using the -s parameters, contigs are not merged, which explains why the contiguity is so low. Have a look in Bandage to understand it better. If you want to improve contiguity, you can use the "merge all possible nodes" option in Bandage. Note that, even like this, contiguity may decrease compared to the original assembly, as phasing is a difficult task.

Here is the explanation of the names of the contig

When the contigs are long, they are split in chunks of 100kbp to alleviate memory usage -> Scaffold_1 is split in Scaffold_1@0, Scaffold_1@1, Scaffold_1@2...
The next number indicates the coordinate of the leftmost base of the new contig compared to the reference: Scaffold_1@11_6000 indicates a contig homologous to the sequence at position 11*100000 + 6000 = 1106000 of Scaffold_1
The next number corresponds to the allele: Scaffold_1@11_6000_0 and Scaffold_1@11_6000_1 correspond to two different allleles.
The last number corresponds to homozygous alleles: Scaffold_1@11_12000_0-0 and Scaffold_1@11_12000_0-1 have the exact same sequence. They have been duplicated to scaffold properly the final assembly, because this region is present twice in the genome.

Best

alexvasilikop commented 1 year ago

Great thanks that's very helpful