WeichenZhou / PALMER

Pre-mAsking Long reads for Mobile Element inseRtion
MIT License
12 stars 5 forks source link

some fields in the output are not clear #24

Open crankycrank opened 1 year ago

crankycrank commented 1 year ago

Hi Could you add some documentation regarding what some of the output columns mean, in particular, what start1, start2, end1, end2 are? Is it possible to compute the breakpoints in the reference genome from them? Thank you

WeichenZhou commented 1 year ago

Hi @crankycrank

Thank you for using PALMER.

Yes, I am thinking of updating PALMER to generate better readable output. For start1, start2, end1, and end2, they are the confident intervals for the start (from start1 to start2) and end (from end1 to end2) of the insertion site in the reference genome. Ideally, they should be equal (and yes for most of the insertions in the output). However, for some kinds of long reads with high sequencing error or in some low confident regions that reads align to, these values would be different. Specifically, start1 should be the most left position in the reference genome and end2 should be the most right position in the reference genome for the insertion instance. Therefore, for the breakpoints of the insertion, you can calculate them from these four values. Since most of the time they are the same, that would be one position for the insertion site which is also the breakpoint site. If they are discordant in the coordinate, I would suggest using a mean value of them or a mean of start1 and start2 as the most confident insertion position.

Let me know if you have any further questions. Thanks!

Keltonc commented 1 year ago

Hi,

I'm also confused with the meaning of some of the fields, for example, start1_inVariant start2_inVariant end1_inVariant end2_inVariant. From the example sample_calls.txt, I can see that all clusters having Confident_supporting_reads and Potential_supporting_reads 0 and 1 respectively. Does this mean all the results are false positive based on your suggestion?

I'm using PALMER to find NUMT but it's hard for me to know whether I'm getting the real results. Therefore, I tried to extract the mapped bam file and look at each "potential" hits and it seems nothing in those positions. Could you please advise me the results I got?

Thank you.

WeichenZhou commented 1 year ago

Hi @Keltonc,

For the first question, start1_inVariant start2_inVariant end1_inVariant end2_inVariant are the start and end positions with a confident range that PALMER predicts the insertion's segment in the MEI (or variant of interest) consensus sequence.

For the second question, not really. That depends on what kind of element you investigate. Confident_supporting_reads are the ones basically have the TSD supports. For example, if you want to check Numt, you have to close the TSD finder module so you will not get any Confident_supporting_reads.

For the third question, I'm not sure how you did the process for checking the sequence. But if you could provide me with a small bam sample in your case and all the results and log of the runs. I could check and troubleshoot it for you.

Best, Arthur

Keltonc commented 1 year ago

Hi @WeichenZhou,

I have tried the following code for detection of NUMT. PALMER --input $INPUT --workdir $WORKDIR --ref_ver other --output $OUTPUT --mode raw --ref_fa $REF --type CUSTOMIZED --custom_seq custom_numt_seq.fasta --TSD_finding FALSE >& palmer.subset.log

The $INPUT file is mapping the pacbio long reads to the reference genome and custom_numt_seq.fasta is just an empty file because we don't know any numt present.

Sorry I can't share the bam file as it is too big but the results and the log file are attached.

cluster_id  read_name.info  5'_TSD  3'_TSD  Predicted_transD    Unique_26mer_at_5'junction  Whole_insertion_seq
cluster0_scaffold20_4623723_4623723_4623723_4623723 m151204_122126_42272_c100937162550000001823213006101632_s1_p0/19663/10179_24616_4619068_202.80.105.9687.9713.4623723.4623723.scaffold20.+.0.0.0.0.0.0.0 N   N   N   N   GCTTCTTAAATAATAAATTATTACATA
cluster0_scaffold23_1700445_1700445_1700445_1700445 m151212_093905_42272_c100936822550000001823213006101671_s1_p0/41123/8031_32362_1687169_196.33.105.17798.17872.1700445.1700445.scaffold23.+.0.0.0.0.0.0.0    N   N   N   N   TAGTTGAAGATTATACTGGCAGAAATTCAGATGACGTTTAGATGAGAAAGCTTCTCAATTAATAAACATTAAATA
cluster0_scaffold316_1016224_1016224_1016224_1016224    m160306_124529_42272_c100949922550000001823211206101643_s1_p0/144089/7888_20937_1016224_5.86.111.4545.4570.1016224.1016224.scaffold316.+.0.0.0.0.0.0.0  N   N   N   N   TAAATAATAAATAATATATAATAATA
cluster0_scaffold554_1091744_1091744_1091744_1091744    m160509_140728_42272_c100938752550000001823211006101672_s1_p0/126354/4366_19072_1091744_25.82.111.6094.6122.1091744.1091744.scaffold554.+.0.0.0.0.0.0.0 N   N   N   N   TTCTTAAATAAAAAACATTACAAATAATA
cluster0_scaffold1032_1049683_1049683_1049683_1049683   m160429_110352_42272_c100938492550000001823211006101621_s1_p0/97942/0_18600_1044744_17.85.111.10496.10522.1049683.1049683.scaffold1032.+.0.0.0.0.0.0.0  N   N   N   N   TTAAATAATAAATATTATATAATACTA
cluster0_scaffold1521_692664_692664_692664_692664   m160414_120845_42272_c100938722550000001823211006101605_s1_p0/131249/0_23511_682819_154.88.111.23012.23035.692664.692664.scaffold1521.-.0.0.0.0.0.0.0   N   N   N   N   TATTATTATGTATTATTTATTATT
cluster0_scaffold2718_110300_110300_110300_110300   m151213_180303_42272_c100936762550000001823213006101660_s1_p0/45850/0_17082_110300_50.88.111.5966.5989.110300.110300.scaffold2718.+.0.0.0.0.0.0.0   N   N   N   N   AATAATAACTAATACATAATAATA
cluster0_scaffold400_1006808_1006808_1006808_1006808    m160306_040645_42272_c100949922550000001823211206101641_s1_p0/129264/22348_32322_1000745_8.79.110.6841.6870.1006808.1006808.scaffold400.-.0.0.0.0.0.0.0 N   N   N   N   ATAATTATATAATTTATTATTTAAGAAGCT
cluster0_scaffold788_1045648_1045648_1045648_1045648    m151212_202400_42272_c100936822550000001823213006101673_s1_p0/114468/0_20383_1036850_8.60.98.11849.11886.1045648.1045648.scaffold788.-.0.0.0.0.0.0.0    N   N   N   N   TAATTATTATTTAAGAAGGTTTGTATTAAAAGGTATAT
cluster0_scaffold5007_16876_16876_16876_16876   m151212_052237_42272_c100936822550000001823213006101670_s1_p0/9031/0_9922_16876_9.87.109.8599.8621.16876.16876.scaffold5007.+.0.0.0.0.0.0.0 N   N   N   N   AAATAATTAATATTACATACTAA
cluster0_scaffold6665_1_1_1_1   m160906_114618_42272_c101094922550000001823234702241756_s1_p0/92343/0_1246_1_0.51.70.2.21.1.1.scaffold6665.+.0.0.0.0.0.0.0  N   N   N   N   CAGAGATCAATAGACGTTTT
cluster_id  chr start1  start2  end1    end2    start1_inVariant    start2_inVariant    end1_inVariant  end2_inVariant  Confident_supporting_reads  Potential_supporting_reads  Ptential_segmental_supporting_reads orientation polyA-tail_size 5'_TSD_size 3'_TSD_size Predicted_transD_size   Has_5'_inverted_sequence?   5'_inverted_seq_end 5'_seq_start
cluster0_scaffold20_4623723_4623723_4623723_4623723 scaffold20  4623723 4623723 4623723 4623723 80  80  105 105 0   1   0   +   -   0   0   0   0   0   0
cluster0_scaffold23_1700445_1700445_1700445_1700445 scaffold23  1700445 1700445 1700445 1700445 33  33  105 105 0   1   0   +   -   0   0   0   0   0   0
cluster0_scaffold316_1016224_1016224_1016224_1016224    scaffold316 1016224 1016224 1016224 1016224 86  86  111 111 0   1   0   +   -   0   0   0   0   0   0
cluster0_scaffold554_1091744_1091744_1091744_1091744    scaffold554 1091744 1091744 1091744 1091744 82  82  111 111 0   1   0   +   -   0   0   0   0   0   0
cluster0_scaffold1032_1049683_1049683_1049683_1049683   scaffold1032    1049683 1049683 1049683 1049683 85  85  111 111 0   1   0   +   -   0   0   0   0   0   0
cluster0_scaffold1521_692664_692664_692664_692664   scaffold1521    692664  692664  692664  692664  88  88  111 111 0   1   0   -   -   0   0   0   0   0   0
cluster0_scaffold2718_110300_110300_110300_110300   scaffold2718    110300  110300  110300  110300  88  88  111 111 0   1   0   +   -   0   0   0   0   0   0
cluster0_scaffold400_1006808_1006808_1006808_1006808    scaffold400 1006808 1006808 1006808 1006808 79  79  110 110 0   1   0   -   -   0   0   0   0   0   0
cluster0_scaffold788_1045648_1045648_1045648_1045648    scaffold788 1045648 1045648 1045648 1045648 60  60  98  98  0   1   0   -   -   0   0   0   0   0   0
cluster0_scaffold5007_16876_16876_16876_16876   scaffold5007    16876   16876   16876   16876   87  87  109 109 0   1   0   +   -   0   0   0   0   0   0
cluster0_scaffold6665_1_1_1_1   scaffold6665    1   1   1   1   51  51  70  70  0   1   0   +   -   0   0   0   0   0   0

palmer.subset.log

The potential_supporting_read is always 0. I am not sure how to interpret the tables. The way I check it is to extract the regions predicted by PALMER and visualize them in IGV. For example, cluster0_scaffold23_1700445_1700445_1700445_1700445 - An insertion in position 1700445 of scaffold 23 is reported

image

I don't see any read supported the insertion. The only thing I can tell is there is a supplementary alignment to the mtDNA. But the length is definitely different from the seq in *.TSD_reads.txt. Maybe I just interpret the whole result in a wrong way.

Looking forward to hearing from you back. Thank you.

WeichenZhou commented 1 year ago

Hi @Keltonc ,

Sorry for the delayed reply. I am swamped with projects. You can send me emails for a quicker response if you like.

For your issue, I don't think it's wise to have an empty custom_numt_seq.fasta here, which I can see is highly possible to cause the problem. Could you please try to assign the fasta sequence to PALMER and test again?

Best, Arthur