faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

non-UCE probes, slice UCE probes from genome #300

Closed crcardenas closed 1 year ago

crcardenas commented 1 year ago

Hello again,

I am working with non-UCE probes to try and integrate novel datasets into my phylogenetic analyses. I have concatenated two UCE and one anchored hybrid enrichment probe set (Coleoptera 1.1k, Adephaga 2.9k, and anchored hybrid enrichment from 1kite) to search for loci in sequence data generated by all three probes. I have found similarly targeted genomic regions in all of these probe sets, but I've isolated and joined all probes that don't overlap using bedtools to maximize the loci search.

Currently, I am extracting targeted loci from genomes, transcriptomes, and anchored hybrid enrichment (AHE) using the "extract uces from genomes" workflow. Using just UCE probes, the script runs fine.

I am able to use phyluce_probe_run_multiple_lastzs_sqlite with the concatenated probes without any issue. However, when I get to the phyluce_probe_slice_sequence_from_genomes I get an error with the following script:

Script

phyluce_probe_slice_sequence_from_genomes  \
        --lastz franken-ahe2-lastz \
        --conf ahe2.conf \
        --flank 800 \
        --name-pattern "franken_probes.fasta_v_{}.lastz.clean" \
        --output franken-ahe1-fasta

Error

^[[38;21m2023-03-26 04:32:18,997 - phyluce_probe_slice_sequence_from_genomes - INFO - =================== Starting Phyluce: Slice Sequence ====$
^[[38;21m2023-03-26 04:32:18,998 - phyluce_probe_slice_sequence_from_genomes - INFO - ----------------- Working on SRR12339099 genome ---------$
^[[38;21m2023-03-26 04:32:18,998 - phyluce_probe_slice_sequence_from_genomes - INFO - Reading SRR12339099 genome^[[0m
Traceback (most recent call last):
  File "/home/cody/.conda/envs/phyluce-1.7.1/bin/phyluce_probe_slice_sequence_from_genomes", line 400, in <module>
    main()
  File "/home/cody/.conda/envs/phyluce-1.7.1/bin/phyluce_probe_slice_sequence_from_genomes", line 308, in main
    lz, args.contig_orient, regex
  File "/home/cody/.conda/envs/phyluce-1.7.1/bin/phyluce_probe_slice_sequence_from_genomes", line 260, in parse_lastz_file
    uce_name = new_get_probe_name(lz.name2, regex)
  File "/home/cody/.conda/envs/phyluce-1.7.1/bin/phyluce_probe_slice_sequence_from_genomes", line 136, in new_get_probe_name
    return match.groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'

Looking at the phyluce script, it seems there is either an issue with the AHE header format or the function defined on line 134, new_get_probe_name(header, regex) , or something else but I am unsure. Do I need to add the parameters --probe-prefix or --probe-regex ? if so, what is your recomendation?

My AHE probe headers look like this:

>TC010565-PA_3_21_13_1_1 |desiger:Vasilikopoulos-et-al_2021_AHE,probes-locus:TC010565-PA_3

I've included what information I thought was necessary to run phyluce but to be honest, I'm not sure what information is needed in the probe headers or if this is the key issue. I could not find an expected fasta header format for this step.

With that context, what information is required in the header? Alternatively, what is the minimum header format necessary to run this step?

If you need additional information I am happy to provide it. Cody

crcardenas commented 1 year ago

I thought you might want to more of an example of headers from each dataset, so here there are:


>TC000027-PA_3_4_1_1_1 |desiger:Vasilikopoulos-et-al_2021_AHE,probes-locus:TC000027-PA_3
....
>uce-271564_p1 |design:Adephaga_v1,designer:Gustafson_et_al,probes-locus:uce-271564,probes-probe:1,probes-source:amphizoa,probes-global-chromo:lane5-index12-CTTGTA-3784_CTTGTA_L005_R1_006_(paired)_trimmed_(paired)_contig_300179,probes-global-start:10,probes-global-end:130,probes-local-start:10,probes-local-end:130
...
>uce-506_p1 |design:coleoptera-v1,designer:faircloth,probes-locus:uce-506,probes-probe:1,probes-source:lepdec1,probes-global-chromo:KI578514.1,probes-global-start:409681,probes-global-end:409801,probes-local-start:10,probes-local-end:130
...
brantfaircloth commented 1 year ago

The probe/bait names in the UCE file have a specific format that is expected by the software. When combining probes from a variety of bait sets together, that naming scheme gets messed up, which is what is causing the problems you are running into.

Your best bet, if you'd like to use phyluce, is to rename the baits to a common scheme that phyluce expects while also ensuring that the names remain unique relative to each other (e.g. each locus needs to have a unique name). The expected fasta header is similar to what you will see in the Coleoptera bait set. For e.g.

>uce-500_p1
ACGTG
>uce-500_p2
GTGAC
>uce-500_p3
GGTC
>uce-501_p1
AGGA
>uce-501_p2
GAAG

So, these are for locus 500, baits p1 through p3, and locus 501 baits p1 and p2. If you look into that file, you will see a lot of metadata after the pipe ("|") in the fasta header, but those data are optional.

What you are wanting to do is also a bit out of scope of what the software is meant to handle.

crcardenas commented 1 year ago

Thanks for your quick reply. I have a few more questions.

It sounds like the data after the pipe are not used in these "slicing" steps. Are they used in any additional steps in the standard workflow after slicing loci from "genomes?" (e.g., Tutorial I: UCE Phylogenomics -- Finding UCE loci, Matching Contigs to probes, etc.).

I had assumed they were, so I might need to modify the workflow before to ensure no two probes from the UCE share the same locus number. Because of this can the id before the underscore (e.g., the uce number: uce-###) be alphanumeric or just numeric?

Something like modifying the ID in the header to look like : >uce-TC000027PA3_p4111 |... Otherwise, I can search for unused UCE numbers in the UCE probe sets and use unique ID's and add the old to the header descriptor. That might simplify renaming my headers.

I have one last, somewhat related question to the header information. Since header information is optional, can I add metadata to the header?

I've found orthologous genes UCEs target, but I'd like to make sure this additional information wont break the normal work flow. e.g., uce-22907_p1 | design:Adephaga_v1, ... probes-global-start:10,probes-global-end:130,probes-local-start:10,probes-local-end:130,orthologous_target:ARFGAP3,ortholog_source:ensemble

I understand it is a little out of scope for phyluce, I have a broad phylogenetic tree and the inclusion of these additional data types provide many taxa not available from UCE capture. The AHE taxa have few overlap with UCE loci, but its not insignificant. I wanted to try the reverse, seeing if the AHE probes that don't overlap to try to increase the data available for all taxa I would like to include. Even if I can modify the headers to behave in phyluce, do you recommend against using phyluce to identify loci with my non-traditional probe set?

Thank you for taking the time to respond to my questions, Cody

brantfaircloth commented 1 year ago

Correct - the metadata are discarded (they're just used in the bait file to let you know aspects of the bait design). You definitely want baits to be unique, and the ids should be numeric. It's possible alphanumeric could work, but numeric WILL work.

What I usually do if attempting what you are doing is to make the bait #s jump significantly by set - so for bait set 1, start indexing at 0; for bait set 2, start indexing at 100000;for bait set 3 start indexing at 200000; etc. Of course, that is somewhat dependent on how many baits are in a given set. If you need to, you can move to the millions (e.g. bait set 1 starts at 0, bait set 2 starts at 1000000, etc.). That keeps it pretty easy to increment and also still lets you know which baits came from what set, as long as you write down which #s are which bait set.

You can add or modify your own info after the pipe to track whatever you like (e.g. the original bait #, etc). You'll also need to track which probe goes with which locus - which seems relatively easy across all the sets.

crcardenas commented 1 year ago

If I can't get phyluce to work with these modified probes I'll have to go a different route.

Thanks as always for your advice.

brantfaircloth commented 1 year ago

Right on - did you get it worked out?

crcardenas commented 1 year ago

I haven't had a chance to work on it this week as much as I had hoped.. I've had a lot of lab work.

I'm hoping to get to it next week at some point.

I'll keep you updated.

crcardenas commented 1 year ago

Just following up, I was finally was able to get some work done on this. I needed to revise my "overlap" search to include a blast search along bwa + bedtools. Either way I was able to get the slice step working after cleaning up the probe names.

I mostly got stuck because because some probes were present that I thought should have been filtered. I realized that one dateset included another (Adephaga2.9 & Coleoptera1.1). I couldn't figure out why they persisted after filtering the Coleoptera UCE probes :upside_down_face:

But I've got a 4.4k probeset that should improve my integrative dataset.

Thanks for your advice as always @brantfaircloth !

brantfaircloth commented 1 year ago

You're welcome - glad you got it sorted!