AlexGa / Phylostratigraphy

Pipeline for Phylostratigraphy
Apache License 2.0
12 stars 4 forks source link

There is only one line of result in the output file called "phyloBlastDB.fa_final_ps_map.csv" #4

Closed TaoCheng98 closed 2 years ago

TaoCheng98 commented 2 years ago

Hi,

Alexander Gabel ,

It is a great honor to use the pipeline for phylostratigraphy that you shared.

However, I have recently had just one problem using your shared pipeline for phylostratigraphy.

In fact, no errors were reported during the process , but my output file contains only one line of results.

I started with proteome of Mycobacterium tuberculosis, which I focused on, but made the error I described above.

Then I used "Acaryochloris Marina MBIC11017" as an example, but the same problem still existed.

I guess that only the last processed protein seems to be recorded in the output file.

This the header of my FASTA-file of the organism :

NP_214515.1 | [Mycobacterium tuberculosis H37Rv] | [Bacteria; Actinobacteria; Actinomycetia; Corynebacteriales; Mycobacteriaceae; Mycobacterium; Mycobacterium tuberculosis] WP_009556083.1 | [Acaryochloris marina] | [Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales; Acaryochloris]

This is the command: perl createPSmap.pl --organism /home/data/t010208/Chengtao/Phylostratigraphic_analysis/rowdata/Acaryochloris_marina_MBIC11017.fasta --database /home/data/t010208/Chengtao/Phylostratigraphic_analysis/phyloBlastDB/phyloBlastDB.fa --prefix phyloBlastDB.fa --seqOffset 50 --evalue 1e-5 --threads 96 --blastPlus

This is the output file: PS;GeneID 1;NP_214523.1

There is just one line of results,and there doesn't seem to be anything special about this protein, except that it's the last protein in my Fasta-file

The script files are all up to date, the last modification date is 27 Jan 2021.

Your reply is greatly appreciated!

Kind regards,

Cheng Tao

AlexGa commented 2 years ago

Hi Cheng Tao,

thank you for your message and for using the phylostratigraphy script. Based on your description, I cannot reproduce your error.

Did you have a look into the xml-Files (*.xml.tbz) that BLAST generates? And have you extended the header information of each sequence in your fasta files?

Best

Alex

TaoCheng98 commented 2 years ago

Hi,

Alexander Gabel ,

Thank you for your reply,

I checked the documents you mentioned.

The total number of xml-Files(*.xml.tbz) that you mentioned in my folder is 80 . Is it right?

By the way , I have noticed that the size of the file called "phyloBlastDB.fa_BLAST_PS_tables.tbz" is just 33KB.

It seems that only one protein was recorded in the file .It means that the file contains only information about a protein called YP_178027.

And the protein,called YP_178027,is the only line of output file called 1_phyloBlastDB.fa_final_ps_map.csv.

Best,

Chengtao

TaoCheng98 commented 2 years ago

Hi,

Alexander Gabel ,

The following information is part of the XML-files you mentioned, and I hope it helps you.

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>BLASTP 2.12.0+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>/home/data/t010208/Chengtao/Phylostratigraphic_analysis/phyloBlastDB/phyloBlastDB.fa</BlastOutput_db>
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>YP_178027.1 | [Mycobacterium tuberculosis H3933Rv] | [Bacteria;Actinobacteria;Actinomycetia;Corynebacteriales;Mycobacteriaceae;Mycobacterium;Mycobacterium tuberculosis]</BlastOutput_query-def>
  <BlastOutput_query-len>406</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>0.001</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_1</Iteration_query-ID>
  <Iteration_query-def>YP_178027.1 | [Mycobacterium tuberculosis H3933Rv] | [Bacteria;Actinobacteria;Actinomycetia;Corynebacteriales;Mycobacteriaceae;Mycobacterium;Mycobacterium tuberculosis]</Iteration_query-def>
  <Iteration_query-len>406</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gnl|BL_ORD_ID|5243865</Hit_id>
  <Hit_def>YP_007353723.1 | [Mycobacterium tuberculosis 7199-99] | [Bacteria; Actinobacteria; Actinobacteria; Actinobacteridae; Actinomycetales; Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium tuberculosis complex; Mycobacterium tuberculosis]</Hit_def>
  <Hit_accession>5243865</Hit_accession>
  <Hit_len>406</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>821.617</Hsp_bit-score>
      <Hsp_score>2121</Hsp_score>
      <Hsp_evalue>0</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>406</Hsp_query-to>
      <Hsp_hit-from>1</Hsp_hit-from>
      <Hsp_hit-to>406</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>0</Hsp_hit-frame>
      <Hsp_identity>406</Hsp_identity>
      <Hsp_positive>406</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>406</Hsp_align-len>
      <Hsp_qseq>MPSPRREDGDALRCGDRSAAVTEIRAALTALGMLDHQEEDLTTGRNVALELFDAQLDQAVRAFQQHRGLLVDGIVGEATYRALKEASYRLGARTLYHQFGAPLYGDDVATLQARLQDLGFYTGLVDGHFGLQTHNALMSYQREYGLAADGICGPETLRSLYFLSSRVSGGSPHAIREEELVRSSGPKLSGKRIIIDPGRGGVDHGLIAQGPAGPISEADLLWDLASRLEGRMAAIGMETHLSRPTNRSPSDAERAATANAVGADLMISLRCETQTSLAANGVASFHFGNSHGSVSTIGRNLADFIQREVVARTGLRDCRVHGRTWDLLRLTRMPTVQVDIGYITNPHDRGMLVSTQTRDAIAEGILAAVKRLYLLGKNDRPTGTFTFAELLAHELSVERAGRLGGS</Hsp_qseq>
      <Hsp_hseq>MPSPRREDGDALRCGDRSAAVTEIRAALTALGMLDHQEEDLTTGRNVALELFDAQLDQAVRAFQQHRGLLVDGIVGEATYRALKEASYRLGARTLYHQFGAPLYGDDVATLQARLQDLGFYTGLVDGHFGLQTHNALMSYQREYGLAADGICGPETLRSLYFLSSRVSGGSPHAIREEELVRSSGPKLSGKRIIIDPGRGGVDHGLIAQGPAGPISEADLLWDLASRLEGRMAAIGMETHLSRPTNRSPSDAERAATANAVGADLMISLRCETQTSLAANGVASFHFGNSHGSVSTIGRNLADFIQREVVARTGLRDCRVHGRTWDLLRLTRMPTVQVDIGYITNPHDRGMLVSTQTRDAIAEGILAAVKRLYLLGKNDRPTGTFTFAELLAHELSVERAGRLGGS</Hsp_hseq>
      <Hsp_midline>MPSPRREDGDALRCGDRSAAVTEIRAALTALGMLDHQEEDLTTGRNVALELFDAQLDQAVRAFQQHRGLLVDGIVGEATYRALKEASYRLGARTLYHQFGAPLYGDDVATLQARLQDLGFYTGLVDGHFGLQTHNALMSYQREYGLAADGICGPETLRSLYFLSSRVSGGSPHAIREEELVRSSGPKLSGKRIIIDPGRGGVDHGLIAQGPAGPISEADLLWDLASRLEGRMAAIGMETHLSRPTNRSPSDAERAATANAVGADLMISLRCETQTSLAANGVASFHFGNSHGSVSTIGRNLADFIQREVVARTGLRDCRVHGRTWDLLRLTRMPTVQVDIGYITNPHDRGMLVSTQTRDAIAEGILAAVKRLYLLGKNDRPTGTFTFAELLAHELSVERAGRLGGS</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
<Hit>
  <Hit_num>2</Hit_num>
  <Hit_id>gnl|BL_ORD_ID|5239873</Hit_id>
  <Hit_def>YP_005924984.1 | [Mycobacterium tuberculosis RGTB423] | [Bacteria; Actinobacteria; Actinobacteria; Actinobacteridae; Actinomycetales; Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium tuberculosis complex; Mycobacterium tuberculosis]</Hit_def>
  <Hit_accession>5239873</Hit_accession>

Thank you for your attention to this matter.

Chengtao

TaoCheng98 commented 2 years ago

Hi,

Alexander Gabel ,

I have checked the XML-files as you suggested.

And I found that the file only contains BLAST information for only one protein.

So the problem could be BLAST.

By the way,my BLAST+ is installed by conda.

Best,

Chengtao