Tasks as described by @Husen in the README page.

Here is a place for discussion about the tasks that are pending:

[ ] Currently, the data of both parquet files non-canonical and canonical are a result of MSGF and COMET search, but we recently added SAGE (new search engine) that probably would be interesting to add more IDs. (@ypriverol task).
[ ] @DongdongdongW will try to do his best to include the MSrescore in quantms to improve the PSMs IDs.
[x] #4
- [x] #2
- [x] #3
[ ] Using the parquet files, we need to perform the following tasks:
- [ ] we need to annotate in the parquet with the peptides that are canonical and non-canonical.
- [ ] we should compare the q-values and PEP distributions of non-canonical and canonical peptides.
- [ ] we should validate the non-canonical peptides that are only one amino-acid variant using spectrumAI?
- [ ] we can validate some of the non-canonical peptides using retention time.
[ ] Compare scores and FDR for known canonical and novel canonical peptides, check distributions.
[ ] Analyze the novel canonical, locations, gene types, other evidence for expression (@husensofteng)
[ ] Draft manuscript layout and sections (@ypriverol @husensofteng)

@husensofteng we need to know if quant information is important, if not I can start doing the search without the quant part which will make the search faster, and also we will use all the psms rather than only the ones we can find a reliable feature for it.

bigbio / pgt-pangenome

Tasks as described by @Husen in the README page. #1