bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link

add spectrumAI #70

Closed DongdongdongW closed 2 years ago

DongdongdongW commented 2 years ago

Two works were possible using validate_peptides, one to calculate the position of the variant amino acids on the variant peptide and the other to validate the variant peptide using spectrumAI. get position:pypgatk validate_peptides --input_psm_table xxx --input_fasta xxx --output_psm_table xxx
'--input_psm_table' is the PSMs table where position is to be obtained. '--input_fasta' is the protein sequence used for quantification. '--output_psm_table' is the file name of the output. spectrumAI: pypgatk validate_peptides --mzml_path xxx --infile_name xxx --outfile_name xxx or pypgatk validate_peptides --mzml_files xxx --infile_name xxx --outfile_name xxx '--mzml_path' is the path to the mzML file in the PSMs table. '--mzml_files' is the name of the mzML file in the PSMs table (need to specify the location of the file, different files are separated by ',') '--infile_name' is the PSMs table that needs to run spectrumAI. It needs to contain 'position', which can be obtained using the the previous command to get position. '--outfile_name' is the file name of the output.

husensofteng commented 2 years ago

Thanks for the great work. I agree with @ypriverol it would be better to have one command for both processes. To avoid re-calculating the variant position we can have a condition to skip the process if the position column exists in the input_psm_table file.

Also, regarding the mzml_path, maybe it is better to change to mzmls_base_path since input_psm_table usually contains PSMs from multiple mzML files and the file names are written in one of the columns.

DongdongdongW commented 2 years ago

Thanks for the great work. I agree with @ypriverol it would be better to have one command for both processes. To avoid re-calculating the variant position we can have a condition to skip the process if the position column exists in the input_psm_table file.

Also, regarding the mzml_path, maybe it is better to change to mzmls_base_path since input_psm_table usually contains PSMs from multiple mzML files and the file names are written in one of the columns.

Thank you for your affirmation.At present, they are under one command, but they still belong to two separate processes. Do you mean we can merge into one process? And at present, mzml_path can be the path of many mzmls. If necessary, I can change mzml_path to mzmls_base_path.

ypriverol commented 2 years ago

No, @DongdongdongW now is fine with only one command. The only pending task is to support mzTab.

DongdongdongW commented 2 years ago

不,@DongdongdongW现在只需一个命令就可以了。唯一未决的任务是支持 mzTab。

got it

husensofteng commented 2 years ago

Regarding replacing blast to identify the variant position, we discussed the following with @ypriverol: We can avoid using blast by implementing a function to identify proteins that overlap:

  1. Non-canonical peptides: each peptide should be compared to the canonical protein sequences and only those that have one mismatch need to be checked by SpectrumAI. There is no need to peptides with more than one mismatch since two or more amino acid differences are quite different than the canonical sequences
  2. Mutated peptides: each peptide should be compared with all sequences and those with miss-match should be further checked by SpectrumAI
DongdongdongW commented 2 years ago

Use our own method to compare peptides and sequences? @husensofteng

husensofteng commented 2 years ago

yes, if we can have an efficient implementation, ahocorasick is good for exact matches though I am not sure about its usability for single mismatches.