Open ddelgadillod opened 3 years ago
Hi Diego,
It's definitely possible to run Falcon-Phase with IPA output! The only thing that needs to be done is to adjust the contig headers in the FASTA files to match the Falcon-Phase requirements.
IPA contains a tool for this, try running the following on the contigs in the 19-final
folder:
falconc ipa2-to-falcon-unzip --input-p-fn final.p_ctg.fasta --input-a-fn final.a_ctg.fasta --output-prefix final.renamed
Then use the final.renamed*
files as input to Falcon-Phase.
Also tagging @zeeev for additional information if needed.
Best regards, Ivan.
Hi
I tried running the suggested tool falconc ipa2-to-falcon-unzip
with falconc installed with pb-IPA over conda environment, unfortunately, I see that this falconc did not contain ipa2-to-falcon-unzip
subcommand, Is it include on another release?.
Regards
Diego
Just following up because I had this same question - the ipa2-to-falcon-unzip
subcommand is in a newer version of falconc
than seems to come with pbipa
. I simply created a new environment with just this version of falconc
in it:
conda create -n falconc pb-falconc
conda activate falconc
# check version
falconc version
Copyright (C) 2004-2021 Pacific Biosciences of California, Inc.
This program comes with ABSOLUTELY NO WARRANTY; it is intended for
Research Use Only and not for use in diagnostic procedures.
falconc version=1.13.1+git.f9d1b5651e891efe379bd9727a0fa0931b875d7b, nim-version=1.5.1
falconc ipa2-to-falcon-unzip <= now works
Following up on my comment above (and tagging @skingan and @zeeev) - it seems that the bioconda version of falconc
(that comes in the pb-falconc package) and the ipa2-to-falcon-unzip
subroutine makes some corrections to the ipa header output, but it does not seem to make them equivalent to falcon-unzip
. Specifically, the alternate/associate contigs are renamed, e.g.:
>hap_ctg.000028F OVLP
renamed to >000028F
But this does not include the numerical count of the haplotig as in falconc
- for example, it seems like the renamed result should be:
>000028F_001
This appears to cause a subsequent failure of preprocess_diploid_asm_for_fc_phase.py
when preparing the name_mappings.txt
file for falcon-phase
from the IPA outputs converted to falcon-unzip format.
It might be pretty trivial to write some code to prepare the name_mappings.txt
file from the converted files and file headers, but it is a little unclear to me the differences in the associate contigs file from the original IPA assembly between, e.g. >hap_ctg.000028F OVLP
and a header like >ctg.000035F-045-01 LN:i:659606 RC:i:835 XC:f:1.000000
, so I'm hesitant to write some code to prepare a name_mappings.txt
without a little more information.
Perhaps it's a simple as:
>hap_ctg.000028F OVLP
should be renamed to >000028F_001
>ctg.000035F-045-01 LN:i:659606 RC:i:835 XC:f:1.000000
should be renamed to >000035F_045
@ddelgadillod this would be useful for me too. I've just assembled a mammalian sized genome using IPA v1.3.0, and obtained both the primary and alternative haplotypes. I've renamed them as suggested above, however I come across the same issue described by brantfaircloth. I find substantially three types of naming, showed here below:
hap_ctg.000103F 17724 219029042 17724 17725
hap_ctg.6330 99243 218858967 99243 99244
ctg.000050F-090-01 35827 2053709557 35827 35828
As suggested above, I suspect that the output naming should be:
000103F_001
6330_001
000050F_090
Is this correct? Or is there something else to consider?
Thank you in advance for the help Andrea
@isovic since I would like to use FALCON-phase to do the downstream phasing of the assembly, and that falconc ipa2-to-falcon-unzip
produces contigs that have invalid names, I've tried to use the naming description added to the main page of pbipa. However, I'd like to know if I'm proceeding correctly, before messing with the haplotigs and hampering the FALCON-phase workflow.
In a nutshel, the 19-final/final.a_ctg.fasta
contains two types of header:
ctg.<contig_id>-<bubble_id>-<branch_id>
hap_ctg.<contig_id>
The first need to be checked and modified as following:
<contig_id>
is present in the primary assembly, then it can simply be renamed to <contig_id>_<bubble_id>
For the second group (contigs to haplotigs), the situation needs to be treated differently. First, if a haplotig has a correspective entry in the primary assembly, it is saved as is adding a numbering as before (contig_id_00N
, with N
the usual progressive number). If a contig is not found in the primary assembly, then I look for more details in file 18-purge_dups/prim.dups.bed
, which provides the classification of the haplotigs and if they have a correspective contig in the primary assembly:
ctg.000247F 0 760448 HAPLOTIG ctg.000144F
ctg.000679F 1 18026 JUNK
ctg.000409F 138193 360386 OVLP
In short, the purged contigs are processed as follow:
000001F_00<N>
, with N
again a progressive number, following up on the previous numbering.000247F
to 000144F_00<N>
, with N
again a progressive number, following up on the previous numbering.JUNK/HIGHCOV
is discarded since not an haplotig.Am I proceeding correctly? Are there any cases I'm missing?
Thank you Andrea
EDIT: I also forgot to say, some contigs come with an id ending with R, which I'm unsure what it means (e.g. 000234R) or as a simple numerical ID (e.g. 5687). In the first case, I simply extract the number, and then append an F; if the ID is not duplicated, then it gets renamed, otherwise I add 100000 to the number to make it novel. Same process for the second, without excluding the R suffix.
Hi
I'm working on the whole genome assembly for Solanum Tuberosum (ploidy 4) with PacBio long reads and Hi-C data, I have a first contig draft assembly generated with pbIPA and another generated with Hicanu, It's in any way possible running Falcon-phase with one of this draft assemblies?
Cheers,
Diego D