PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

Output the relative coordinates of the associated contigs on top of their primary contigs #83

Open pb-jchin opened 9 years ago

pb-jchin commented 9 years ago

I will use for some analysis to know where the associated contigs mapped to on the primary. The begin and the end of an associated contig can be fully specified by the nodes in the string graph. We can scan through the tiling path of the primary to get the coordinates.

pb-jchin commented 9 years ago

temporary solution for users who need this today

One can run this simple script under 2-asm-falcon directory. It will print a list of a_tig_id, start_pos, end_pos to stdout. This simple code should be integrated into fc_graph_to_contig.py eventually.

from falcon_kit.FastaReader import FastaReader

p_ctg_coor_map = {}
with open("p_ctg_tiling_path") as f:
    for row in f:
        row = row.strip().split()
        ctg_id, v, w, edge_rid, b, e  = row[:6]
        if ctg_id not in p_ctg_coor_map:
            coor = 0   # the p_ctg_tiling_path should be sorted by contig the order of the edges in the tiling path
            p_ctg_coor_map[ctg_id] = {}
            p_ctg_coor_map[ctg_id][v] = 0
            coor += abs(int(b) - int(e))
            p_ctg_coor_map[ctg_id][w] = coor
            continue
        else:
            coor += abs(int(b) - int(e))
            p_ctg_coor_map[ctg_id][w] = coor

a_ctg_fasta = FastaReader("a_ctg.fa")
for r in a_ctg_fasta:
    rid = r.name.split()
    rid, v, w = rid[:3]
    pid = rid.split("-")[0]
    print rid, p_ctg_coor_map[pid][v], p_ctg_coor_map[pid][w]

The code is checked in diconsensus development branch as fc_actg_coordinate.py for now.

pb-jchin commented 9 years ago

It might be useful to document an alternative way to get related information as independent validation by running nucmer for individual contig. For example, this maps the associate contigs to the primary contigs using nucmer and dump the mapped coordinates. In case, one need to do proper parameter setting and filtering.

$ grep -A 1 000000F p_ctg.fa > p_ctg_000000F.fa
$ grep -A 1 000000F a_ctg.fa > a_ctg_000000F.fa
$ nucmer -mum p_ctg_000000F.fa a_ctg_000000F.fa -p nucmer_aln
$ show-coords -T -H -L 10000 nucmer_aln.delta | awk '{print $NF" "$1" "$2}'  | less

000000F-001-01 28503309 28567305
000000F-002-01 35510179 35552182
000000F-003-01 27329414 27363334
000000F-004-01 49369415 49410904
000000F-005-01 21818290 21846900
000000F-006-01 13887528 13903602
000000F-007-01 55731118 55759935
...
arangrhie commented 9 years ago

Hello, I've a simple question. Are the coordinates given by fc_graph_to_contig.py inclusive? Are they 0-based or 1-based? I want to extract the bases overlapped both in primary and associated contigs, and make a new fa file representing another haplotype of my genome. If my output of fc_graph_to_contig.py looks like the example above, ex. 000000F-001-01 28503309 28567305 do I need to extract bases of 000000F:1-28503308 and then add 000000F-001-01 again add bases of 000000F:28567306~35510178 and so on?

pb-jchin commented 9 years ago

The code used exclusively Python index slicing notation. (see the paragraph about "how slices work" https://docs.python.org/2/tutorial/introduction.html ) The coordinate output is 0-base, left inclusive, right non-inclusive

Here is an example from https://docs.python.org/2/tutorial/introduction.html

 +---+---+---+---+---+---+
 | P | y | t | h | o | n |
 +---+---+---+---+---+---+
 0   1   2   3   4   5   6
-6  -5  -4  -3  -2  -1

By the way, while in most case, an associated contig is fully from a different haplotype, but there is no absolute proof that is true for all cases. There will be some false positives. One thing has less ambiguity is that the large difference between the primary and associated contigs are indeed some big different between different haplotypes.

fc_graph_to_contig.py does not generate the interval of the associated contigs to the primary contigs. I think you mean the script above.

For the example above,

000000F-001-01 28503309 28567305

it means 000000F-001-01 mapped to 000000F "28503309: 28567305" in Python's notation.

Also, there is a output file called a_ctg_base.fa which already contains the information that you plan to generate.