faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

match_contigs_to_probes error/regex related problems #198

Closed AlejandraPanzera closed 3 years ago

AlejandraPanzera commented 3 years ago

Hello!

I am getting an error in the mentioned step. At first, I used a wrong regex and it run, but in the next step I realized something was wrong. Went back to match_contigs_to_probes and corrected the regex to capture my probe names, as follows:

--regex '^>.exons.'

for the following probe format:

long_exons|EOG090700KG|CscutB_PGA_scaffold16_644_contigs_length_47111555:30370957-30370790|2x|1 shortexons_50bpextended|CscutB_PGA_scaffold6_699_contigs_length_94985253:51285269-51285431|end|

My command is this one: $ python2.7 match_contigs_to_probes.py --contigs Velvet_kmer90/contigs/ --probes Final_RG_9702.fasta --output ./Velvet90_output_REGEX --regex '^>.exons.'

As you can see, it starts running but then I get errors: 2020-08-13 16:02:47,216 - match_contigs_to_probes - INFO - ================ Starting match_contigs_to_probes =============== 2020-08-13 16:02:47,216 - match_contigs_to_probes - INFO - Version: 1.5.0 2020-08-13 16:02:47,216 - match_contigs_to_probes - INFO - Argument --contigs: /Users/alejandrapanzera/Desktop/TRANSFER_GLOBUS/Velvet_kmer90/contigs 2020-08-13 16:02:47,216 - match_contigs_to_probes - INFO - Argument --dupefile: None 2020-08-13 16:02:47,217 - match_contigs_to_probes - INFO - Argument --keep_duplicates: None 2020-08-13 16:02:47,217 - match_contigs_to_probes - INFO - Argument --log_path: None 2020-08-13 16:02:47,217 - match_contigs_to_probes - INFO - Argument --min_coverage: 80 2020-08-13 16:02:47,217 - match_contigs_to_probes - INFO - Argument --min_identity: 80 2020-08-13 16:02:47,217 - match_contigs_to_probes - INFO - Argument --output: /Users/alejandrapanzera/Desktop/TRANSFER_GLOBUS/Velvet90_output_REGEX 2020-08-13 16:02:47,217 - match_contigs_to_probes - INFO - Argument --probes: /Users/alejandrapanzera/Desktop/TRANSFER_GLOBUS/Final_RG_9702.fasta 2020-08-13 16:02:47,217 - match_contigs_to_probes - INFO - Argument --regex: ^>.exons. 2020-08-13 16:02:47,217 - match_contigs_to_probes - INFO - Argument --verbosity: INFO Traceback (most recent call last): File "match_contigs_to_probes.py", line 335, in main() File "match_contigs_to_probes.py", line 238, in main uces = set(new_get_probe_name(seq.id, regex) for seq in SeqIO.parse(open(args.probes, 'rU'), 'fasta')) File "match_contigs_to_probes.py", line 238, in uces = set(new_get_probe_name(seq.id, regex) for seq in SeqIO.parse(open(args.probes, 'rU'), 'fasta')) File "match_contigs_to_probes.py", line 227, in new_get_probe_name return match.groups()[0] AttributeError: 'NoneType' object has no attribute 'groups'

If my regex is correct, what is the problem? It was running with no apparent error with the wrong regex (but of course it recovered 0 unique contigs).

The files in the contiguous folder have this format name: Cscu_s124.contigs.fasta

So no weird symbols or special characters as mentioned in another similar issues here in GitHub.

How can I fix this? Any help will be really appreciated!!

brantfaircloth commented 3 years ago

As far as I can tell, you are going to need to rename your baits/probes in that file to something that fits the expected scheme a little better... for example the software expects the bait/probe names to follow something along the lines of what is says in the full help of the program:

Match UCE probes/baits to assembled contigs and store the data in a relational
database. The matching process is dependent on the probe names in the file. If
the probe names are not like 'uce-1001_p1' where 'uce-' indicates we're
searching for uce loci, '1001' indicates locus 1001, '_p1' indicates this is
probe 1 for locus 1001, you will need to set the optional --regex parameter.
So, if your probe names are 'MyProbe-A_probe1', the --regex will look like
--regex='^(MyProbe-\W+)(?:_probe\d+.*)
AlejandraPanzera commented 3 years ago

Thanks so much for the quick reply!

Yes, I saw that part, but it also said that if the names of the probes don't match that you should add --regex after commands that suit your probe names.

You are saying that doing that it will still fail then?

brantfaircloth commented 3 years ago

You need to design the bait names to be something that is unique and trackable and the regular expression is used to parse those names out of the fasta header line - i'm not sure how the names that you shared fit with that scheme.

The names in your file of baits need to be something like:

>long-1_p1
>long-1_p2
>long-1_p3
>long-2_p1
>long-2_p2
>short-1_p1
>short2_p2

And then your regular expression would be something like '^(long|short-\W+)(?:_probe\d+.*).

AlejandraPanzera commented 3 years ago

Thank you! I'll replace the probe names and try that.