RefSeqMrnaId -> UniProt accession mapping

mariacuria commented 1 week ago

Is there software that does this? Search message boards.
Manually check 5-10 rows of the table. Does cBio have RefSeq NP_<...>? - No. => Need mapping from NM to NP.
Once you have the NP column, go to UniProt (or use their API) and add the column "UniProt accession".
You will get the FASTA sequences for all NPs and all UniProt accessions. Do pairwise alignment (should take a couple of hours on the server, run overnight) all against all. Use BLAST, CLUSTAL or T-COFFEE or whatever.
From the pairwise alignment you will get the equivalent UniProt position. Add it to your table. E. g., you have position 94 in the mRNA, in 80% of the cases it should be the same position, but in 20% it could be position 120 in the UniProt canonical.
Add QC procedure in the parsing alignment file with at least 95% or something positions to be aligned.
Document everything.
Show the code to @seankim658.
Show the results during the Friday meeting.

Before you start doing this, manually do this for EGFR. Get one position and do the entire workflow manually and show @rajamazumder.

Do the frequency based on the number of patients and show this column next Friday.

mariacuria commented 6 days ago

New workflow:

Having the hg19 genomic positions, map them to ensembl hg38 at the genome level.
Grab the corresponding ensembl transcript ID.
Map to ensembl protein ID.
Map to UniProt accession.
Show @rykahsay on Mon in the internal meeting.

mariacuria commented 11 hours ago

[x] Find a minimal tuple that uniquely characterizes each chromosomal position in order to trace GRCh38 positions back to the original json objects containing GRCh37 positions
[x] Extract chromosomal positions from json objects that are already in GRCh38
[ ] Find ENSP IDs

GW-HIVE / biomuta-old