Closed alexvasilikop closed 1 year ago
Hi Alex,
As you noticed, HairSplitter does not split all parts of a contig in the same number of haplotypes in the first step of the algorithm. This does NOT mean there is aneuploidy in the genome - in you case, it probably means that two of the three haplotypes are homozygous between position 96000 and 102000, or that the region between 102000 and 106000 is a repeated region that occurs at least three times in the original genome.
All these haplotypes are then untangled (~scaffolded) in the last step of HairSplitter. If you want more details, you can inspect the tmp/zipped_assembly.gfa. If you run HairSplitter with parameter -s
(don't simplify), the resulting contigs will have names such as ..._Scaffold_1@1_96000_1_Scaffold_1@1_102000_2_...
This means that the contig is the concatenation of Scaffold_1@1_96000_1
and Scaffold_1@1_102000_2
. The contig containing Scaffold_1@1_96000_1
is homologous with the contig containing Scaffold_1@1_96000_0
I will try to make it clearer in the output of HairSplitter.
Thanks Roland, Is there a simple way to see which of the output scaffolded contigs are actually homologous sequences without having to look the names of subcontigs contained in each (maybe to be added in future versions?).
In my case, I am just interested to get an estimate of the ploidy of this genome by looking at the number of contigs / regions that have a specific ploidy level (2, 3, 4 alleles etc). For example if most supercontigs (or regions) have only 1 homolog, then the genome should be (mostly) diploid?
Why does the output contain more scaffolds than my input genome which is chromosome-level? Does Hairsplitter breaks scaffold connections based on coverage?
Thanks again Alex
The best thing that comes to my mind now would be to align the obtained supercontigs on the original haploid assembly, and visualize using e.g. Tablet, or do it the other way around and visualize using Bandage.
Hi Roland,
I cannot find the parameter -r. Is this implemented in version 1.3.4?
python3 /mnt/sda1/Alex/software/Hairsplitter/hairsplitter.py -h
usage: hairsplitter.py [-h] -i ASSEMBLY -f FASTQ [-x TECHNOLOGY] [-m] [-t THREADS] -o OUTPUT [-s] [-p] [-F] [--path_to_minimap2 PATH_TO_MINIMAP2] [--path_to_racon PATH_TO_RACON]
[--path_to_samtools PATH_TO_SAMTOOLS] [-d] [-v]
options:
-h, --help show this help message and exit
-i ASSEMBLY, --assembly ASSEMBLY
Original assembly in GFA or FASTA format (required)
-f FASTQ, --fastq FASTQ
Sequencing reads fastq (required)
-x TECHNOLOGY, --technology TECHNOLOGY
{ont, pacbio, hifi} [ont]
-m, --multiploid Use this option if all haplotypes can be assumed to have the same coverage
-t THREADS, --threads THREADS
Number of threads [1]
-o OUTPUT, --output OUTPUT
Output directory
-s, --dont_simplify Don't rename the contigs and don't merge them
-p, --polish-everything
Polish every contig with racon, even those
-F, --force Force overwrite of output folder if it exists
--path_to_minimap2 PATH_TO_MINIMAP2
Path to the executable minimap2 [minimap2]
--path_to_racon PATH_TO_RACON
Path to the executable racon [racon]
--path_to_samtools PATH_TO_SAMTOOLS
Path to samtools [samtools]
-d, --debug Debug mode
-v, --version Print version and exit
best alex
Sorry, the parameter is -s
(dont simplify)
Hello Roland,
it seems that in the output assembly after using the -r option there is no scaffolding performed (on the contrary, although the original assembly was 11 scaffolds, the ouput is 19634). But it might be because the original assembly was scaffolded with HiC. In any case the contigs look like this:
>Scaffold_1@11_6000_1-0
>Scaffold_1@11_6000_0-0
>Scaffold_1@11_58000_0-1
>Scaffold_1@11_58000_0-0
>Scaffold_1@11_54000_1-0
>Scaffold_1@11_54000_0-0
>Scaffold_1@11_4000_0-1
>Scaffold_1@11_4000_0-0
>Scaffold_1@11_262000_1-0
>Scaffold_1@11_262000_0-0
>Scaffold_1@11_260000_1-0
>Scaffold_1@11_260000_0-0
>Scaffold_1@11_18000_0-1
>Scaffold_1@11_18000_0-0
>Scaffold_1@11_178000_0-0
>Scaffold_1@11_176000_1-0
>Scaffold_1@11_176000_0-0
>Scaffold_1@11_16000_1-0
>Scaffold_1@11_16000_0-0
>Scaffold_1@11_12000_0-1
>Scaffold_1@11_12000_0-0
There is only one scaffold always in the contig name which shows the output is actually less contiguous than the input (since you have many contigs from the same input scaffold). I was wondering what the last numbers in the name mean. For example:
Scaffold_1@11_12000_0-0
Scaffold_1@11_12000_0-1
The last numbers correspond to alternative alleles (0,1)? Why is the difference sometimes changed to the last before the end digit?:
Scaffold_1@11_16000_1-0
Scaffold_1@11_16000_0-0
And what does 11_12000 stand for in the first case?
Thanks for your help Alex
I forgot to mention that by doing a quick modification like this of the headers, I get the number of potential alleles for the same region? Let meknow if I am wrong (it is based on the assumption that the last 2 numbers are different alleles.
grep ">" hairsplitter_final_assembly.fasta| sort -nr| cut -f1,2,3 -d "_"|sort |uniq -c
2 >Scaffold_6@8_200000
2 >Scaffold_6@8_202000
2 >Scaffold_6@8_208000
2 >Scaffold_6@8_210000
2 >Scaffold_6@8_220000
2 >Scaffold_6@8_222000
2 >Scaffold_6@8_24000
2 >Scaffold_6@8_242000
2 >Scaffold_6@8_244000
2 >Scaffold_6@8_26000
2 >Scaffold_6@8_28000
2 >Scaffold_6@8_30000
2 >Scaffold_6@8_86000
2 >Scaffold_6@8_88000
2 >Scaffold_6@9_0
2 >Scaffold_6@9_102000
2 >Scaffold_6@9_104000
2 >Scaffold_6@9_106000
2 >Scaffold_6@9_108000
2 >Scaffold_6@9_110000
2 >Scaffold_6@9_112000
2 >Scaffold_6@9_114000
2 >Scaffold_6@9_116000
2 >Scaffold_6@9_186000
2 >Scaffold_6@9_188000
2 >Scaffold_6@9_194000
2 >Scaffold_6@9_196000
2 >Scaffold_6@9_218000
2 >Scaffold_6@9_220000
2 >Scaffold_6@9_226000
2 >Scaffold_6@9_228000
2 >Scaffold_6@9_238000
2 >Scaffold_6@9_240000
2 >Scaffold_6@9_250000
2 >Scaffold_6@9_252000
3 >Scaffold_6@9_260000
3 >Scaffold_6@9_262000
3 >Scaffold_6@9_264000
3 >Scaffold_6@9_266000
3 >Scaffold_6@9_268000
2 >Scaffold_6@9_276000
2 >Scaffold_6@9_30000
2 >Scaffold_6@9_32000
2 >Scaffold_6@9_98000
1 >Scaffold_7@0_0
2 >Scaffold_7@0_290000
1 >Scaffold_7@0_294000
1 >Scaffold_7@1-0
2 >Scaffold_8@0_0
2 >Scaffold_8@0_126000
2 >Scaffold_8@0_134000
2 >Scaffold_8@0_136000
1 >Scaffold_8@0_146000
2 >Scaffold_8@0_2000
2 >Scaffold_8@0_230000
2 >Scaffold_8@0_232000
2 >Scaffold_8@0_242000
1 >Scaffold_8@0_244000
2 >Scaffold_8@0_46000
2 >Scaffold_8@0_54000
3 >Scaffold_8@0_58000
2 >Scaffold_8@0_60000
2 >Scaffold_8@0_62000
2 >Scaffold_8@0_66000
2 >Scaffold_8@0_72000
1 >Scaffold_8@0_74000
1 >Scaffold_9@0_0
2 >Scaffold_9@0_56000
1 >Scaffold_9@0_58000
Best
Hi,
When using the -s
parameters, contigs are not merged, which explains why the contiguity is so low. Have a look in Bandage to understand it better. If you want to improve contiguity, you can use the "merge all possible nodes" option in Bandage. Note that, even like this, contiguity may decrease compared to the original assembly, as phasing is a difficult task.
Here is the explanation of the names of the contig
Best
Great thanks that's very helpful
Hello Roland,
I was wondering how to interpret the output of Hairsplitter
Does this mean that Hairsplitter outputs different ploidies for different parts of the original assembly? You can see here that both intervals are on the same original scaffold (Scaffold_1). It was not very clear from the documantation that the software computes ploidy separately for different parts of the original scaffold. Does this above mean that in the first interval it detects 2 haplotypes but in the second interval 3 haplotypes (possible aneuploidy)?
Is there a way to see which supercontigs are different alleles of other supercontigs in the final assembly?
Thanks Alex