Closed crcardenas closed 2 years ago
These are limitations of Gblocks, as far as I can tell. And, because Gblocks source code is not open, we cannot make a change to fix. That said, there are some workarounds.
First, you can strip the uce names from all of your loci (phyluce_align_remove_locus_name_from_files
) - that will remove all the locus name characters from each line. To remove the spades
bit you could use sed
or similar. You can also filter your loci to ensure that the final set you are trimming contain the correct number of taxa using phyluce_align_get_only_loci_with_min_taxa
.
If you continue to have issues with taxon name length, you could (1) use another trimming algorithm like trimAl
(also implemented in phyluce) which I think will take long names or (2) shorten your taxon names using phyluce_align_convert_one_align_to_another
which has an automated --shorten-names
parameter, then run the shortened files back through Gblocks.
Thats rather unfortunate that gblocks behaves that way. Thanks for the advice though!
However, I used:
phyluce_align_remove_locus_name_from_files \
--alignments mafft-nexus-internal-trimmed/ \
--output testout/ \
--input-format fasta \
--output-format fasta \
--cores 4
but the files got converted in an unexpected way
>Agabetes_acuductus_SRR10334071 uce-235988_Agabetes_acuductus_SRR10334071 |uce-235988
....
>Adalia_bipunctata_GCA_910592335 uce-99_Adalia_bipunctata_GCA_910592335 |uce-99
I solved it with a pretty straightforward sed command though.
for i in ./testout/*.fasta; do sed -i '/^>/ s/ .*//' $i; done
I then ran the code I had an issue with and this was solved.
Glad you got it working.
I have had some issues with names previously, trying to concatenate multiple datasets to find UCE loci, and had mostly overcome that. However, I am having an issue with phyluce_align_get_gblocks_trimmed_alignments_from_untrimmed
Here is my command, pretty standard:
It runs until it gets to a particular uce (uce-223111.fasta)... (I abbreviated the dots here).
I first tried building a fresh environment (with --verbosity CRITICAL), but I get the same error:
Next, I tried running gblocks on the fasta causing the issue and another random one.
and it was successful with the other fasta file.
My next strategy is move the problematic fasta out of the directory it lives, but it found a new problematic file. Copying the file to test the run
you can see a part of the aforementioned naming issues here in the names of the first 20 sequences:
uce-22311.fasta
uce-138187.fasta
It seems there is a limit to the length of names and fails to produce the *fasta-gb file.
Either way, I started to remove more fastas that caused this issue, with more of the same same issue:
uce-155796.fasta
I can keep going, but thatwill be an effort in futility and wasted time as I keep finding more.
I'd appreciate any help!
While I wait I am going to try renaming the headers (removing "_spades" at least).
Two things to note that may or may not be relevant. phyluce_align_seqcap_align did not have any issues with the file names and ran fine with this command:
I will note that using "gblocks" doesnt call the program, and only Gblocks does. Even though its listed in my environment as gblocks (both environments are identical).