CPTAC MS dataset files in s3 bucket are not found

pixuenan commented 2 weeks ago

Hi, thanks for creating this tool. But recently when I use pepquery2 in web application and stand alone version, there was error about the mgf file in s3 bucket is not exist. I attached the screenshot of the error on web application. Screenshot from 2024-09-20 16-24-07

My input peptide is MAEASPHPGRYFCHCCSVEIVPRLPIISVQDASLVLSRSFRKRPEHRKWFCPLHSSHRPEPATVGHVDQHLFTLPQGYGQFAFGIFDDSFEIPTFPPGAQADDGRDPESRRERDHPSRHRYGARQPRARLTTRRATGRHEGVPTLEG

wenbostar commented 2 weeks ago

For a given peptide precursor (a combination of peptide sequence, charge and modification), if there is no any spectra matched from the query, it will print out something like "*.mgf doesn't exist". This is not an error message from the search.

wenbostar commented 2 weeks ago

It’s quite common for some peptides not to have any spectra matched in a query

pixuenan commented 1 week ago

Thanks for the reply

pixuenan commented 1 week ago

May I know is there a way to query multiple protein sequences as in a single input file in the stand alone version? It seems that only multiple peptides query in a single input file is supported now.

wenbostar commented 1 week ago

Yes, you could put your protein sequences in a FASTA format file like the one below and then set parameter as "-i target_proteins.fasta -t protein -s 1". This only works for novel protein search not known protein search.

target_proteins.fasta :

>sp|A0A087WT01|TVA27_HUMAN T cell receptor alpha variable 27 OS=Homo sapiens OX=9606 GN=TRAV27 PE=1 SV=1
MVLKFSVSLLWLQLAWVSTQLLEQSPQFLSLQEGENLTVYCNSSSVFSSLQWYRQEPGEG
PVLLVTVVTGGEVKKLKRLTFQFGDARKDSSLHLTAAQTGDTGLYLCAG
>sp|A0A1B0GTB2|TUNAR_HUMAN Protein TUNAR OS=Homo sapiens OX=9606 GN=TUNAR PE=1 SV=2
MVLTSENDEDRGGQEKESKEESVLAMLGLLGTLLNLLVLLFVYLYTTL
>sp|A0A1W2PP97|THSD8_HUMAN Thrombospondin type-1 domain-containing protein 8 OS=Homo sapiens OX=9606 GN=THSD8 PE=3 SV=2
MARTPGALLLAPLLLLQLATPALVYQDYQYLGQQGEGDSWEQLRLQHLKEVEDSLLGPWG
KWRCLCDLGKQERSREVVGTAPGPVFMDPEKLLQLRPCRQRDCPSCKPFDCDWRL
>sp|A0AUZ9|KAL1L_HUMAN KAT8 regulatory NSL complex subunit 1-like protein OS=Homo sapiens OX=9606 GN=KANSL1L PE=1 SV=2
MTPALREATAKGLSFSSLPSTMESDKMLYMESPRTVDEKLKGDTFSQMLGFPTPEPTLNT
NFVNLKHFGSPQSSKHYQTVFLMRSNSTLNKHNENYKQKKLGEPSCNKLKNLLYNGSNLQ
LSKLCLSHSEEFLKKEPLSDTTSQCMKDVQLLLDSNLTKDTNVDKVQLQNCKWYQENALL
DKVTDAELKKGLLHCTQKKLVPGHSNVPVSSSAAEKEEEVHARLLHCVSKQKLLLSQARR
TQKHLQMLLAKHVVKHYGQQMKLSMKHQLPKMKTFHEPTTLLGNSLPKCTELKPEVNTLT
AENKLWDDAKNGFARCTAAELQRFAFSATGLLSHVEEGLDSDATDSSSDDDLDEYTLRKN
VAVNCSTEWKWLVDRARVGSRWTWLQAQLSDLECKLQQLTDLHRQLRASKGLVVLEECQL
PKDLLKKQMQFADQAASLNLLGNPQVPQECQDPVPEQDFEMSPSSPTLLLRNLEKQSAQL
TELLNSLLAPLNLSPTSSPLSSKSCSHKCLANGLYRSASENLDELSSSSSWLLNQKHSKK
KRKDRTRLKSSSLTFMSTSARTRPLQSFHKRKLYRLSPTFYWTPQTLPSKETAFLNTTQM
PCLQSASTWSSYEHNSESYLLREHVSELDSSFHSVLSLPSDVPLHFHFETLLKKTELKGN
LAENKFVDEYLLSPSPVHSTLNQWRNGYSPLCKPQLRSESSAQLLQGRKKRHLSETALGE
RTKLEESDFQHTESGSHSNFTAVSNVNVLSRLQNSSRNTARRRLRSESSYDLDNLVLPMS
LVAPAKLEKLQYKELLTPSWRMVVLQPLDEYNLGKEELEDLSDEVFSLRHKKYEEREQAR
WSLWEQSKWHRRNSRAYSKNVEGQDLLLKEYPNNFSSSQQCAAASPPGLPSENQDLCAYG
LPSLNQSQETKSLWWERRAFPLKGEDMAALLCQDEKKDQVERSSTAFHGELFGTSVPENG
HHPKKQSDGMEEYKTFGLGLTNVKKNR

pixuenan commented 1 week ago

Thanks, that helps a lot. May I ask how to say a protein search result is confident or not? By looking at the pepquery result, the psm_rank.txt is reported at the peptide level. Is there any downstream analysis required for the novel protein identification?

wenbostar commented 1 week ago

We have some description at http://pepquery.org/document.html#saoutput to show how to interpret the result in the psm_rank.txt file, such as how a match is considered as confident in a query.

bzhanglab / PepQuery

CPTAC MS dataset files in s3 bucket are not found #75