cschin / peregrine-2021

Other
58 stars 6 forks source link

Output details #7

Open RenzoTale88 opened 2 years ago

RenzoTale88 commented 2 years ago

Hello, I've just tried to run peregrine on a set of hifi reads for a large mammalian genome. The software finished successfully, but I can't find an explanation of the different outputs generated by the software. In particular, I have the following outputs:

asm_ctgs_e.fa
asm_ctgs_e0.fa
asm_ctgs_m.fa
asm_ctgs_m_a.fa
asm_ctgs_m_p.fa
asm_ctgs_m_rel.dat

I'm assuming the file asm_ctgs_m_p.fa is the primary assembly, whereas the asm_ctgs_m_a.fa is the assembly carrying the alternative alleles, but I'm unsure about the other files. Thank you in advance, best regards Andrea

RenzoTale88 commented 2 years ago

@cschin sorry for insisting with this, any update on the documentation concerning the outputs?

cschin commented 2 years ago

Here are some comments

These two are the initial assembly output:

asm_ctgs_m.fa  # main contig
asm_ctgs_e0.fa # extra contig (fragmented smaller results due to erroneous 
reads or very complicated repeats)

some of the contig in asm_ctgs_e0.fa may be duplicated, a de-duplication 
process is applied to filter out contigs in asm_ctgs_e0.fa to generate asm_ctgs_e.fa

The contigs in  asm_ctgs_m.fa go through a process to identify homologous 
contigs between two haplotypes in a diploid genome. The "primary contigs" 
are kept in  asm_ctgs_m_p.fa and the "associated contigs" are kept in asm_ctgs_m_a.fa. 
The asm_ctgs_m_rel.dat contains information of the relation between the 
contigs inside asm_ctgs_m_a.fa to the contigs inside asm_ctgs_m_p.fa.