Trinotate / Trinotate.github.io

web documentation for Trinotate
48 stars 17 forks source link

integrate sigP 5.0 #31

Open brianjohnhaas opened 4 years ago

brianjohnhaas commented 4 years ago

do it

LuciaPita commented 4 years ago

Dear Brian, dear Trinotate users,

I am using sigP 5.0 on Trinotate v3.2.0. Here some issues I have encountered, in case it can help others. I am not a programmer so probably there are more elegant ways to solve them:

First, I needed to shorten the protein headers of the protein fasta file (transdecoder output; e.g., protein.fasta) to have only the protein ID. Otherwise, there were errors due to invalid characters:

awk -F " " '/^>/ {print $1; next} 1' protein.fasta > sig_v5.input.protein.fasta

Then, I was not able to analyze the whole transcriptome at once, despite trying with different values in the new --batch parameter. The solution was to split the transcriptome with the fasta-splitter script developed by Kirill Kryukov http://kirill-kryukov.com/study/tools/fasta-splitter/. A division into files of 100000 sequences worked for me:

perl fasta-splitter.pl sig_v5.input.protein.fasta --part-size 100000 --measure count

That value, 100000, was the one I used in the --batch parameter for signalP. The results can be easily concatenated later.

brianjohnhaas commented 4 years ago

terrific! thanks for contributing this!

On Fri, Nov 15, 2019 at 4:21 AM LuciaPita notifications@github.com wrote:

Dear Brian, dear Trinotate users,

I am using sigP 5.0 on Trinotate v3.2.0. Here some issues I have encountered, in case it can help others. I am not a programmer so probably there are more elegant ways to solve them:

First, I needed to shorten the protein headers of the protein fasta file (transdecoder output; e.g., protein.fasta) to have only the protein ID. Otherwise, there were errors due to invalid characters:

awk -F " " '/^>/ {print $1; next} 1' protein.fasta > sig_v5.input.protein.fasta

Then, I was not able to analyze the whole transcriptome at once, despite trying with different values in the new --batch parameter. The solution was to split the transcriptome with the fasta-splitter script developed by Kirill Kryukov http://kirill-kryukov.com/study/tools/fasta-splitter/. A division into files of 100000 sequences worked for me:

perl fasta-splitter.pl sig_v5.input.protein.fasta --part-size 100000 --measure count

That value, 100000, was the one I used in the --batch parameter for signalP. The results can be easily concatenated later.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Trinotate/Trinotate.github.io/issues/31?email_source=notifications&email_token=ABZRKX2NDPHZEAPQTLYNFFLQTZS2BA5CNFSM4JK5W7QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEE2LSA#issuecomment-554280392, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX33AAMS6PDBYVZZLOTQTZS2BANCNFSM4JK5W7QA .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas