PacificBiosciences / pb-falcon-phase

FALCON-Phase integrates PacBio long-read assemblies with Phase Genomics Hi-C data to create phased, diploid, chromosome-scale scaffolds
Other
5 stars 3 forks source link

It's possible running with IPA draft assembly #7

Open ddelgadillod opened 3 years ago

ddelgadillod commented 3 years ago

Hi

I'm working on the whole genome assembly for Solanum Tuberosum (ploidy 4) with PacBio long reads and Hi-C data, I have a first contig draft assembly generated with pbIPA and another generated with Hicanu, It's in any way possible running Falcon-phase with one of this draft assemblies?

Cheers,

Diego D

isovic commented 3 years ago

Hi Diego,

It's definitely possible to run Falcon-Phase with IPA output! The only thing that needs to be done is to adjust the contig headers in the FASTA files to match the Falcon-Phase requirements.

IPA contains a tool for this, try running the following on the contigs in the 19-final folder:

falconc ipa2-to-falcon-unzip --input-p-fn final.p_ctg.fasta --input-a-fn final.a_ctg.fasta --output-prefix final.renamed

Then use the final.renamed* files as input to Falcon-Phase.

Also tagging @zeeev for additional information if needed.

Best regards, Ivan.

ddelgadillod commented 3 years ago

Hi

I tried running the suggested tool falconc ipa2-to-falcon-unzip with falconc installed with pb-IPA over conda environment, unfortunately, I see that this falconc did not contain ipa2-to-falcon-unzip subcommand, Is it include on another release?.

Regards

Diego

brantfaircloth commented 3 years ago

Just following up because I had this same question - the ipa2-to-falcon-unzip subcommand is in a newer version of falconc than seems to come with pbipa. I simply created a new environment with just this version of falconc in it:

conda create -n falconc pb-falconc
conda activate falconc
# check version
falconc version
Copyright (C) 2004-2021     Pacific Biosciences of California, Inc.
This program comes with ABSOLUTELY NO WARRANTY; it is intended for
Research Use Only and not for use in diagnostic procedures.

falconc version=1.13.1+git.f9d1b5651e891efe379bd9727a0fa0931b875d7b, nim-version=1.5.1

falconc ipa2-to-falcon-unzip <= now works
brantfaircloth commented 3 years ago

Following up on my comment above (and tagging @skingan and @zeeev) - it seems that the bioconda version of falconc (that comes in the pb-falconc package) and the ipa2-to-falcon-unzip subroutine makes some corrections to the ipa header output, but it does not seem to make them equivalent to falcon-unzip. Specifically, the alternate/associate contigs are renamed, e.g.:

>hap_ctg.000028F OVLP renamed to >000028F

But this does not include the numerical count of the haplotig as in falconc - for example, it seems like the renamed result should be:

>000028F_001

This appears to cause a subsequent failure of preprocess_diploid_asm_for_fc_phase.py when preparing the name_mappings.txt file for falcon-phase from the IPA outputs converted to falcon-unzip format.

It might be pretty trivial to write some code to prepare the name_mappings.txt file from the converted files and file headers, but it is a little unclear to me the differences in the associate contigs file from the original IPA assembly between, e.g. >hap_ctg.000028F OVLP and a header like >ctg.000035F-045-01 LN:i:659606 RC:i:835 XC:f:1.000000, so I'm hesitant to write some code to prepare a name_mappings.txt without a little more information.

Perhaps it's a simple as:

>hap_ctg.000028F OVLP should be renamed to >000028F_001 >ctg.000035F-045-01 LN:i:659606 RC:i:835 XC:f:1.000000 should be renamed to >000035F_045

RenzoTale88 commented 2 years ago

@ddelgadillod this would be useful for me too. I've just assembled a mammalian sized genome using IPA v1.3.0, and obtained both the primary and alternative haplotypes. I've renamed them as suggested above, however I come across the same issue described by brantfaircloth. I find substantially three types of naming, showed here below:

hap_ctg.000103F 17724   219029042       17724   17725
hap_ctg.6330    99243   218858967       99243   99244
ctg.000050F-090-01      35827   2053709557      35827   35828

As suggested above, I suspect that the output naming should be:

000103F_001
6330_001
000050F_090

Is this correct? Or is there something else to consider?

Thank you in advance for the help Andrea

RenzoTale88 commented 2 years ago

@isovic since I would like to use FALCON-phase to do the downstream phasing of the assembly, and that falconc ipa2-to-falcon-unzip produces contigs that have invalid names, I've tried to use the naming description added to the main page of pbipa. However, I'd like to know if I'm proceeding correctly, before messing with the haplotigs and hampering the FALCON-phase workflow.

In a nutshel, the 19-final/final.a_ctg.fasta contains two types of header:

  1. Haplotigs identified by the unzipping stage, labelled ctg.<contig_id>-<bubble_id>-<branch_id>
  2. Contigs moved to the haplotig folder, renamed hap_ctg.<contig_id>

The first need to be checked and modified as following:

  1. If <contig_id> is present in the primary assembly, then it can simply be renamed to <contig_id>_<bubble_id>
  2. If the original contig has been moved to the haplotig by purge_dups, then it is discarded

For the second group (contigs to haplotigs), the situation needs to be treated differently. First, if a haplotig has a correspective entry in the primary assembly, it is saved as is adding a numbering as before (contig_id_00N, with N the usual progressive number). If a contig is not found in the primary assembly, then I look for more details in file 18-purge_dups/prim.dups.bed, which provides the classification of the haplotigs and if they have a correspective contig in the primary assembly:

ctg.000247F     0       760448  HAPLOTIG        ctg.000144F 
ctg.000679F     1       18026   JUNK
ctg.000409F     138193  360386  OVLP

In short, the purged contigs are processed as follow:

  1. The haplotigs with a correspective in the primary assembly are expanded as 000001F_00<N>, with N again a progressive number, following up on the previous numbering.
  2. In case the haplotig has no matching id in the primary assembly, but has a matching contig (first entry of the bed), it gets renamed from 000247F to 000144F_00<N>, with N again a progressive number, following up on the previous numbering.
  3. In case an haplotig is not present in the primary assembly and is classified as JUNK/HIGHCOV is discarded since not an haplotig.
  4. If an haplotig has overlaps, but they are ambiguous (i.e. matching multiple primary contigs), then it can't be placed correctly and I discarded them.

Am I proceeding correctly? Are there any cases I'm missing?

Thank you Andrea

EDIT: I also forgot to say, some contigs come with an id ending with R, which I'm unsure what it means (e.g. 000234R) or as a simple numerical ID (e.g. 5687). In the first case, I simply extract the number, and then append an F; if the ID is not duplicated, then it gets renamed, otherwise I add 100000 to the number to make it novel. Same process for the second, without excluding the R suffix.