faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
78 stars 49 forks source link

phyluce_probe_get_multi_merge_table failed to retrieve correctly conserved loci #185

Closed gushiro closed 3 years ago

gushiro commented 4 years ago

I encounter an issue I do not quite understand. After I filtered my bed files (as in the tutorial: https://phyluce.readthedocs.io/en/latest/tutorial-four.html), run "phyluce_probe_get_multi_merge_table", and grep the common loci among my 23 species; I got this (just the upper part here as an example):

UCE,Conting,start,end,sp1,sp2,sp3,sp4,sp5,sp6,sp7,sp8,sp9,sp10,,sp11,sp12,sp13,sp14,sp15,sp16,sp17,sp18,sp19,sp20,sp21,sp22,sp23
1693203,ScSUXzt_204,395658,395873,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1693185,ScSUXzt_204,381334,381463,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1692783,ScSUXzt_204,56102,56308,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1692780,ScSUXzt_204,54924,55398,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

When I reconstruct my fasta, I realized the outgroups have a much-shorted sequence length than the species closer to the reference. I was expecting all loci conserve would have a more or less similar length, expect some indels within the overlapping range. So I run samtools tview (see below for "ScSUXzt_204:395658"):

For species 1: This is a closer species, reads start in position 395658, so it is within the range.

sp1 tview

For species 3: This is also a closer species, and reads start before position 395658, so they are also within the range.

sp3 tview

For species 23: This is the outgroup. The reads here starts 17 base pairs after the start position. I was expecting find reads starting either before position 395658 or from position 395658.

sp23 tview

So my question is, why is this happening? maybe there is a filter or a default value within phyluce_probe_get_multi_merge_table that I am missing. The same happens for other conserve loci, particularly in species that are not close to the reference genome.

brantfaircloth commented 4 years ago

I’m not sure what the question is here - it’s expected that outgroups would have shorter loci.

gushiro commented 4 years ago

Sorry if I did not make it clear. For this particular locus, why the range did not start before position 395658? Reads from species 3 covered this locus before this position. The analog question would be why the range for this locus does not start around position 395681, as reads in species 23 start around this position?

Hope my question is clear now.

thanks in advance

gushiro commented 4 years ago

Does anyone have any solution to this issue? Thanks in advance