FePhyFoFum / phyx

phylogenetics tools for linux (and other mostly posix compliant) computers
blackrim.org
GNU General Public License v3.0
112 stars 17 forks source link

pxrms outputting copy of original #97

Closed JimStarrett closed 5 years ago

JimStarrett commented 5 years ago

Hello again

I am running pxrms but the outfile I am getting is an exact copy of the infile. So, no individuals are being pruned out. I would actually like to get the complement, but when I run it with the -c flag it returns an empty file. Not sure what is going on, but here is my command and an example of my name list. I also tried with the comma separated list but no success.

pxrms -f name.txt -s T396_L1.fasta -o out5.fa

The format of my name list is like so:

I21560_AUMS19499_Araneae_Lycosidae_Schizocosa_ocreata_seq1 I21567_AUMS19519_Araneae_Lycosidae_Schizocosa_uetzi_rovneri_seq1 I21570_AUMS19434_Araneae_Lycosidae_Schizocosa_ocreata_seq1 I21563_AUMS19507_Araneae_Lycosidae_Schizocosa_retrorsa_seq1

josephwb commented 5 years ago

Is it possible that you have Windows (or "classic mac") line endings? That could account for things. Because it works with the following.

"test.fa":

>TaxonA
AAATTTCCCTGTCCCTTTAA
>TaxonB
GCTCGAGGGGCCCCAAGACC
>TaxonC
ACGCTCCCCCTTAAAAATGA
>TaxonD
TCCTTGTTCAACTCCGGTGG
>TaxonE
TTACTATTCCCCCCCGCCGG
>I21560_AUMS19499_Araneae_Lycosidae_Schizocosa_ocreata_seq1
AAATTTCCCTGTCCCTTTAA
>I21567_AUMS19519_Araneae_Lycosidae_Schizocosa_uetzi_rovneri_seq1
AAATTTCCCTGTCCCTTTAA
>I21570_AUMS19434_Araneae_Lycosidae_Schizocosa_ocreata_seq1
AAATTTCCCTGTCCCTTTAA
>I21563_AUMS19507_Araneae_Lycosidae_Schizocosa_retrorsa_seq1
AAATTTCCCTGTCCCTTTAA

"names.txt":

I21560_AUMS19499_Araneae_Lycosidae_Schizocosa_ocreata_seq1
I21567_AUMS19519_Araneae_Lycosidae_Schizocosa_uetzi_rovneri_seq1
I21570_AUMS19434_Araneae_Lycosidae_Schizocosa_ocreata_seq1
I21563_AUMS19507_Araneae_Lycosidae_Schizocosa_retrorsa_seq1

Now, run:

$ pxrms -s test.fa -f name.txt 
>TaxonA
AAATTTCCCTGTCCCTTTAA
>TaxonB
GCTCGAGGGGCCCCAAGACC
>TaxonC
ACGCTCCCCCTTAAAAATGA
>TaxonD
TCCTTGTTCAACTCCGGTGG
>TaxonE
TTACTATTCCCCCCCGCCGG
josephwb commented 5 years ago

And when I run it with -c I get:

$ pxrms -f name.txt -s test.fa -c
>I21560_AUMS19499_Araneae_Lycosidae_Schizocosa_ocreata_seq1
AAATTTCCCTGTCCCTTTAA
>I21567_AUMS19519_Araneae_Lycosidae_Schizocosa_uetzi_rovneri_seq1
AAATTTCCCTGTCCCTTTAA
>I21570_AUMS19434_Araneae_Lycosidae_Schizocosa_ocreata_seq1
AAATTTCCCTGTCCCTTTAA
>I21563_AUMS19507_Araneae_Lycosidae_Schizocosa_retrorsa_seq1
AAATTTCCCTGTCCCTTTAA

Maybe copy-paste these examples and make sure they work for you?

JimStarrett commented 5 years ago

Yep, that sort of seems to have been the issue. I was able to replicate your example. My name.txt file did have Unix line endings, but when I changed the file type of my name.txt to Text File in TextWrangler it worked properly. Thanks for your help!

josephwb commented 5 years ago

Glad we could figure this (one) out! If you are dealing with Windows files, you can do:

dos2unix FILES

If "classic mac" (which is more likely if you are using TextWrangler), do:

dos2unix -c mac FILES

Because I come across this ever so often, I have the following alias in my .bashrc:

alias mac2unix='dos2unix -c mac'

so then if I come across a mac file I can just:

mac2dos FILE

HTH. And I wonder if this is involved in your other issue (#95)?

JimStarrett commented 5 years ago

Thank you for those conversion commands! I'll take a look at my tree files to see if that is the issue for #95 .

Now that I have the pxrms command working for one file I am trying to implement this in a shell script with a loop so I can do this for about 500 alignments. Do you have any suggestions for a 'for loop'?

For example I have align_1.fasta align_2.fasta align_3.fasta etc.

and want align_1_reduced_taxa.fasta align_2_reduced_taxa.fasta align_3_reduced_taxa.fasta until the last file.

josephwb commented 5 years ago

If you want to process all of the fasta files in a directory (i.e. remove the same set of taxa from all) you can do:

for x in *.fasta; do pxrms -f name.txt -s $x -c -o $x\_reduced.fa; done

This will generate files of the name align_*.fasta_reduced.fa. There are cleverer ways to get the exact output name you specify above, but I would have to think about it.

josephwb commented 5 years ago

Ok, you can do this:

for x in *.fasta; do pxrms -f name.txt -s $x -c -o ${x%.}_reduced.fasta; done