Problem with phyluce_assembly_match_contigs_to_probes for Trinity and Spades assemblies

charbeez commented 9 months ago

Hi Brant, I'm having trouble running phyluce_assembly_match_contigs_to_probes on assemblies from both the SPADES and the TRINITY. The headers for each are accepted when run independently and both datasets work if run separately, but then generate subsequent separate probes.matches.sqlite databases. Is there a way to either 1) combine the SPADES and TRINITY datasets to have phyluce read them together (like maybe altering the headers of one or the other type of file) and produce one sqlite database, or 2) to merge the two sqlite databases for processing downstream?

My headers for each fasta type, for reference. Both appear to fit the regex in the /.phyluce/config, and are processed without a problem independently, but I'd like to incorporate and process these data together.

TRINITY_DN8_c0_g1_i4 len=1747 path=[0:0-199 2:200-385 4:386-417 6:418-425 7:426-445 8:446-491 9:492-655 11:656-681 12:682-841 14:842-938 16:939-1009 17:1010-1083 18:1084-1141 20:1142-1179 21:1180-1513 23:1514-1746] GTTTGGCGTGCTTTTCATTCTAAACTGTCTGGAGGCGATAAAGCTAAACAGTGAAGAAGTAGCTTGGTCCCAGTCGAGATAGCCTTGACCAATTCACGCACAAGGGGCTACTCACGTTGCTCGGAGCTGCAACGATGGCGGACGAGAGATTTAGCGTCGTAGACTATGTAGTCTTCAG

NODE_1_length_4195_cov_17.912093 ACTGTATGGGACACCTAGAATGAAGGGGATGGCAAAATGAGAATAAATGGAAGAGAAGAG AGAGAAAACAAGGGAGAGAAAGGAAGGAATGTAGAAAGAAAGGAAGGAGAGGAGGAAGAA

Thanks!

brantfaircloth commented 9 months ago

Howdy,

you can do either. meaning you can modify the headers to appear like what is expected for spades or trinity (but not both), or you can integrate two databases into one - see the "Incorporating Outgroup Data" sections starting here. You use both databases for this step, and the next step, then you should be good to go.

Basically, you'll treat one or the other data sources as "outgroup data".

charbeez commented 9 months ago

Thank you for getting back to me so quickly! I tried the above incorporating outgroup/other data with:

phyluce_assembly_get_match_counts \ --locus-db /fs/scratch/PAS1918/CB_UCE_NMNH/Final_tree_MOO+transcriptomes/MOOs+ZOs/uce-search-results/probe.matches.sqlite \ --taxon-list-config /fs/scratch/PAS1918/CB_UCE_NMNH/Final_tree_MOO+transcriptomes/taxon-list.conf \ --taxon-group 'dataset' \ --extend-locus-db /fs/scratch/PAS1918/CB_UCE_NMNH/Final_tree_MOO+transcriptomes/Transcriptomes/uce-search-results/probe.matches.sqlite \ --output /fs/scratch/PAS1918/CB_UCE_NMNH/Final_tree_MOO+transcriptomes/dataset.conf

and am coming up with another error- I've copied the final few lines below. It looks like this ran successfully using both separate probe.matches.sqlite, however every sequence failed to detect any UCE loci. Do you have any suggestions on how to proceed?

2023-09-29 11:24:23,630 - phyluce_assembly_get_match_counts - INFO - Failed to detect 1373 UCE loci in MOO_53_16 2023-09-29 11:24:23,630 - phyluce_assembly_get_match_counts - INFO - Failed to detect 1363 UCE loci in MOO_52_14 2023-09-29 11:24:23,630 - phyluce_assembly_get_match_counts - INFO - Failed to detect 1352 UCE loci in MOO_46_14 2023-09-29 11:24:23,632 - phyluce_assembly_get_match_counts - INFO - Writing the taxa and loci in the data matrix to /fs/scratch/PAS1918/CB_UCE_NMNH/Final_tree_MOO+transcriptomes/dataset.conf 2023-09-29 11:24:23,633 - phyluce_assembly_get_match_counts - INFO - ========== Completed phyluce_assembly_getmatch counts ==========

And my taxon-list.conf file begins with [dataset] with the second set of transcriptome-sourced data (in the extend-locus-db) were denoted with an asterisk following each

Thanks again for your help!

faircloth-lab / phyluce

Problem with phyluce_assembly_match_contigs_to_probes for Trinity and Spades assemblies #316