T-Wisse / MEP_Thomas

This repository serves as the documentation platform for my MEP in TU Delft.
1 stars 0 forks source link

Classification of essential genes based on SATAY data #25

Open T-Wisse opened 2 years ago

T-Wisse commented 2 years ago

Goal

I intend to classify all genes as essential or non-essential in different genetic backgrounds, based on SATAY data. These backgrounds should at least be dbem3 and dbem1dbem3, but may include more backgrounds. Based on this information we can determine which and how many genes are only essential in specific genetic backgrounds, and not in specific others. This may show in what manner genetic interactions change with the genetic background. The relation of identified genes of interest with the changes in the genetic background can then be investigated further .

Previous work

Previous work from Wessel resulted in MATLAB scripts that could be used to classify genes as either essential or non-essential. These scripts can be found on the TU Delft drive (staff-groups\tnw\bn\ll\Shared\Theses\Supplementary files_Master_thesis_Wessel_Teunisse\Supplementary files). The resulting classifier was based on 9 features and was 92.7% correct (shown below). image

However, it required the transposon mapping to occur within MATLAB. To make the script more relevant and easier to use, it needed to use the output of the transposonmapper instead.

Classifier

The entire S2 and S3 scripts have been rewritten to become compatible with this output and can be found as classifierWesselAdapted.m. Accompanying functions are included in the directory classifierFunctions. Furthermore some bugs were fixed and I am in the process of adding better documentation. The script now also uses the same annotation (gff) file as the transposonmapper.

Using this script, a classifier has been trained on the wild-type data acquired by Leila, using the same features the original script used. The trained classifier (trainedModelBaggedTree_220125.mat) can be found HERE. The resulting confusion matrix is shown below. ConfusionMatrix220125

This corresponds to about 85% correct. Most of the difference between this and Wessel's confusion matrix comes from the true essential class, which are less often correctly classified as essential.

Results

The trained classifier was used to classify genes with SATAY data from yTW001_4 and yLIC137_7. The resulting workspace (classification_260125.mat) is included here for ease. Gene CCT8 was identified to be essential in dbem3, but not in dbem1dbem3. Genes EFB1, RTG3, PRP28, SRP101, ALG13, YJL195C and PRP19, were identified to be essential in dbem1dbem3, but not dbem3. The distribution of the classification scores can be found below. To the eye they appear slightly differently distributed. Could be interesting.

classificationScores_yLIC137_7 classificationScores_yTW001_4 The classification score is returned by the classifier and is a measure of how well a gene classifies as essential. The lower the value, the more the features of the gene correspond to an essential gene. To keep the gene list manageable and keep a high confidence, a value below 0.05 is used to delineate genes that are definitely essential in the corresponding genetic background. However, the classifier denotes the genes as essential for a score of around 0.3 or lower (depending on the trained classifier). The 0.05 norm resulted in the list of essential genes in only 1 genetic background mentioned above.

The nature of these genes is yet to be investigated. However, that more genes are essential in the dbem1dbem3 background than the dbem3 background suggests that more genes become essential when genes are removed and the fitness of the cell is reduced.

Issues

Some issues and unexplained observations and tasks remain.

leilaicruz commented 2 years ago

Great @T-Wisse ! Very nice . I will outline some questions here:

I think as next steps you should:

sanity checks:

T-Wisse commented 2 years ago

Thanks for the suggestions and thoughts. I will also update the post above to include some of these points. In short:

leilaicruz commented 2 years ago

Thanks for the suggestions and thoughts. I will also update the post above to include some of these points. In short:

  • Yes, I have added a table with the scores for yLIC137_7 and yTW001_4. A higher score corresponds to a higher probability to be essential

OK

  • Will edit the post to do so

OK

  • Yes the accuracy I obtain here is lower than what Wessel got. Especially noticeable in the lower percentage of essential genes that are correctly predicted by my classifier. Wessel used different data than I have, which I am now looking for. I will run the classifier on that data, too. However, I don't really expect that to explain the entire difference. I will make some slight changes that I think could help, including how I use the WT data I have to train the classifier.

yes you should look into which factors makes the accuracy to go down in this type of model.

Interestingly, I have previously obtained a better result using less features. I will have to see what changed. I might have introduced a bug somewhere.

Did you compute the features or did you take how he implemented them?

  • I have added a table (.mat and .xlsx) which shows what I believe you suggest. The rows are the genes that are essential in at least 1 background. The columns show their predicted essentiality in WT, yLIC137 and yTW001_4. It also includes the annotated essential genes in WT.

Could you make a heatmap with some interesting genes ?

  • I still have to check the differences between replicates.

OK

  • Yes, I just need to find his data. On the other hand, I will also use his classifier run on my data for comparison

OK