faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
78 stars 49 forks source link

using phyluce_assembly_match_contigs_to_probes on phased sequences #266

Closed idanaughton closed 2 years ago

idanaughton commented 2 years ago

I'm attempting to match my phased UCE sequences to my UCE probes using phyluce_assembly_match_contigs_to_probes, and start the phyluce process over with my phased data in order to construct alignments of taxon-specific groups from my phased samples. I get the following error when trying to use phyluce_assembly_match_contigs_to_probes with my phased fasta files and probe set:

Traceback (most recent call last): File "/data/home/idanaughton/.conda/envs/phyluce-1.7.1/bin/phyluce_assembly_match_contigs_to_probes", line 421, in main() File "/data/home/idanaughton/.conda/envs/phyluce-1.7.1/bin/phyluce_assembly_match_contigs_to_probes", line 354, in main contig_name = get_contig_name(lz.name1) File "/data/home/idanaughton/.conda/envs/phyluce-1.7.1/bin/phyluce_assembly_match_contigs_to_probes", line 279, in get_contig_name return match.groups()[0] AttributeError: 'NoneType' object has no attribute 'groups'

I'm guessing this has to do with how my phased reads are named, which follows this convention: uce-11841_INM640_0 |uce-11841_phased where INM640 is the sample name. I tried adding a config file at ~/.phyluce.conf with the following contents (after reading through other issues below):

[headers] trinity:comp\d+_c\d+_seq\d+|c\d+_g\d+_i\d+|TR\d+|c\d+_g\d+_i\d+|TRINITY_DN\d+_c\d+_g\d+i\d+ velvet:node\d+ abyss:node\d+ idba:contig-\d+\d+ spades:NODE_\d+length\d+cov\d+.\d+ itero:uce-\d+length\d+cov\d+.\d+ phased:uce-\d+_IN\d+_d+ |uce-\d+_phased

but still get the same error.

Any pointers here would be much appreciated. Thanks much!

brantfaircloth commented 2 years ago

The header should be truncated by biopython at the first space (which comes before the pipe "|" character). Given that, it looks like you will need to update the phased regular expression to something like:

phased:uce-\d+_INM\d+_d+

which adds the "M" in the sample name. If the "M" can be any letter (e.g. it's different by sample), then something like:

phased:uce-\d+_\w+_d+

adding the "\w" in place of "INM\d+" should catch all letter/number characters in that middle position.

idanaughton commented 2 years ago

This worked, thank you! Just had to figure out how to truncate with biopython and tweak the expression to: uce-\d+IN\w+\d

Thanks again!

brantfaircloth commented 2 years ago

You bet 👍