Five Prime Sequences Compared to s288c Genome

I've been going over ModelUsage.ipynb. I’ve been trying to check if the five prime sequences match the sequences I am looking at on the s288c genome. For this I have been referencing the file "SpeciesLM/data/Sequences/Annotation/Assembled/five_prime.parquet". Sometimes the five prime sequences match with the s288c genome and sometimes they do not. I have shown an example output of the code that I run to view the sequence for particular genes.

# output is actually a pandas Series not str
# "SpeciesLM/data/Sequences/Annotation/Assembled/five_prime.parquet"
dataset_five_utr.loc[dataset[dataset['gene_id'] == "YBL072C"].index.tolist()]['five_prime_seq']
>>>"TTGTACCGTACACAGTGTTGCTTGCTAATTGCATGATCAGGTTGGGCCGACGCGGTTGCTTGAATGTGGCCCCTGTAAGAAACTTTATTGAAGAAGGTTGCGATGGCGTTACTGATTTATACGTGGGGATCTACGATGATCTTGCTAGCACTAATTTCACAGACAGGATAGCTGCGTGGGAGAATATTGTTGAGTGCACCTTTAGGACCAACAACGTAAAATTGGGTTACCTCATTGTAGATGAGTTTCACAACTTTGAAACGGAGGTCTACCGGCAGTCGCAATTTGGGGGCATAACTAACCTTGATTTTGACGCTTTTGAGAAAGCAATCTTTTTGAGCGGCACAGCCCCTGAGGCTGTTGCTGATGCTGCGTTGCAGCGTATTGGGCTTACGGGACTGGCCAAGAAGTCGATGGACATCAACGAGCTCAAACGGTCGGAAGATCTCAGCAGAGGTCTATCCAGCTATCCAACACGGATGTTTAATCTAATCAAGGAGAAATCCGAGGTGCCTTTAGGGCATGTTCATAAAATTTGGAAGAAAGTGGAATCACAGCCCGAAGAAGCACTGAAGCTTCTTTTAGCCCTCTTTGAAATTGAACCAGAGTCGAAGGCCATTGTAGTTGCAAGCACAACCAACGAAGTGGAAGAATTGGCCTGCTCTTGGAGAAAGTATTTTAGGGTGGTATGGATACACGGGAAGCTTGGGTGCTGCAGAAAAGGTGTCTCGCACAAAGGAGTTTGTCACTGACGGTAGCATGCAAGTTCTCATCGGAACGAAATTAGTGACTGAAGGAATTGACATTAAGCAATTGATGATGGTGATCATGCTTGATAATAGACTTAATATTATTGAGCTCATTCAAGGCGTAGGGAGACTAAGAGATGGGGGCCTCTGTTATCTATTATCTAGAAAAAACAGTTGGGCGGCAAGGAATCGTAAGGGTGAATTACCACCGATTAAGGAAGGCTGTATAACCGAACAGGTACGCGAGTTCTATG"

I have attached images from a word document that go over a few sequences showing where the processed ‘five_prime_seq’ contains the five prime sequence from the s288c genome (first example) and when it does not (last 3 examples). One the last example I provide the BLASTn result also. I've used images so I can do syntax highlighting a bit easier, but If needed I could share the raw sequence.

YAL037W

SpeciesLM-FASTA-S288c_Genome-YAL037W

YDL061C

SpeciesLM-FASTA-S288c_Genome-YDL061C

YBL092W

SpeciesLM-FASTA-S288c_Genome-YBL092W

src torchcell sequence genome scerevisiae s288c md YBL092W-SpeciesLM_five_prime_utr-CDS-BLASTn

Could you please help explain why this is? When I BLASTn your ‘five_prime_seq’ I find the sequence on the same chromosome but hundreds to thousands of bp away. I suspect maybe a different genome was used for this data processing. Some description of the data processing step would help me. Thanks for your consideration.

I am a bit confused by this. I am currently travelling, so I did not have time to look at every example you posted in detail. But for the YDL061C, I seem to get a different sequence. Specifically, I do:

fivepq = pd.read_parquet("../Sequences/Annotation/Assembled/five_prime.parquet") fivepq.query('gene_id == "YDL061C"')[seq_col].iloc[0]

The last few nucleotides of the sequence I get are: "AACTTATTGGTTATAGAATATATACAAAATG"

This seems to not correspond to any of the three sequences you posted for that gene. Specifically, it does not correspond to the sequence you cite as being from the SpeciesLM, which confuses me a bit. Maybe you can recheck this?

However, the sequence I get does correspond to what I find on the UCSC genome browser (SacCer3): Screenshot from 2023-09-24 22-33-30

Generally, all the sequences we have come from Ensembl, so we have whatever gene build is there (for S. cer this is R-64-1-1, which to my knowledge is the same as SacCer3). So possibly a difference in the gene build is indeed the issue.

I will check the other examples in the coming days, but maybe this already helps.

Thanks for the response. I've found that my query is incorrect. Hold off on this one, and I'll go through them again.

I have been using R64-4-1, but I think the issue was mostly my query. There will likely be the same. I am most interested in YIL111W since it is unclear to me how to process genes like it.

YAL037W

SpeciesLM-FASTA-S288c_Genome-YAL037W-corrected_query

Looks good with correct query.

YDL061C

SpeciesLM-FASTA-S288c_Genome-YDL061C-corrected_query

Looks good with correct query.

YIL111W

In my first post I accidentally put YAL037W twice, instead of YIL111W.

SpeciesLM-FASTA-S288c_Genome-YIL111W-corrected_query

What is the logic here? The gene has one intron and a single bp CDS. How was the position to start the UTR determined? Is it an issue that this CDS doesn't start with a start codon?

This sequence is an interesting edge case. Because it has an intron that seperates the start codon, ensembl annotates it as beginning with "TG" (see https://fungi.ensembl.org/Saccharomyces_cerevisiae/Transcript/Exons?db=core;g=YIL111W;r=IX:155311-155765;t=YIL111W_mRNA) and omits the 1bp exon. You would have to ask Ensembl why they chose to annotate this way. I have to admit that I did not really anticipate the case of the start codon being split at all, so I also did not implement any filtering for it.

Overall, this scenario is probably too rare for the model to effectively learn what is going on (in all of the data, this happens in 44k out of 14.8 million sequences, or about 0.3%). This being said, the predictions of the model do not change very much if I artifically mutate the TGT the sequence ends with to ATG. So the mere fact that the ATG is missing doesn't seem to collapse the predictions, but it is generally not predicted super well (e.g. the splice donor has much worse reconstruction than in other 5'UTR introns in S. cer - but this might also be because both the donor and the acceptor deviate from their canonical sequences, which iirc is rare in yeast).

gagneurlab / SpeciesLM