Classification of essential genes based on SATAY data

Goal

I intend to classify all genes as essential or non-essential in different genetic backgrounds, based on SATAY data. These backgrounds should at least be dbem3 and dbem1dbem3, but may include more backgrounds. Based on this information we can determine which and how many genes are only essential in specific genetic backgrounds, and not in specific others. This may show in what manner genetic interactions change with the genetic background. The relation of identified genes of interest with the changes in the genetic background can then be investigated further .

Previous work

Previous work from Wessel resulted in MATLAB scripts that could be used to classify genes as either essential or non-essential. These scripts can be found on the TU Delft drive (staff-groups\tnw\bn\ll\Shared\Theses\Supplementary files_Master_thesis_Wessel_Teunisse\Supplementary files). The resulting classifier was based on 9 features and was 92.7% correct (shown below).

However, it required the transposon mapping to occur within MATLAB. To make the script more relevant and easier to use, it needed to use the output of the transposonmapper instead.

Classifier

The entire S2 and S3 scripts have been rewritten to become compatible with this output and can be found as classifierWesselAdapted.m. Accompanying functions are included in the directory classifierFunctions. Furthermore some bugs were fixed and I am in the process of adding better documentation. The script now also uses the same annotation (gff) file as the transposonmapper.

Using this script, a classifier has been trained on the wild-type data acquired by Leila, using the same features the original script used. The trained classifier (trainedModelBaggedTree_220125.mat) can be found HERE. The resulting confusion matrix is shown below. ConfusionMatrix220125

This corresponds to about 85% correct. Most of the difference between this and Wessel's confusion matrix comes from the true essential class, which are less often correctly classified as essential.

Results

The trained classifier was used to classify genes with SATAY data from yTW001_4 and yLIC137_7. The resulting workspace (classification_260125.mat) is included here for ease. Gene CCT8 was identified to be essential in dbem3, but not in dbem1dbem3. Genes EFB1, RTG3, PRP28, SRP101, ALG13, YJL195C and PRP19, were identified to be essential in dbem1dbem3, but not dbem3. The distribution of the classification scores can be found below. To the eye they appear slightly differently distributed. Could be interesting.

classificationScores_yLIC137_7 classificationScores_yTW001_4 The classification score is returned by the classifier and is a measure of how well a gene classifies as essential. The lower the value, the more the features of the gene correspond to an essential gene. To keep the gene list manageable and keep a high confidence, a value below 0.05 is used to delineate genes that are definitely essential in the corresponding genetic background. However, the classifier denotes the genes as essential for a score of around 0.3 or lower (depending on the trained classifier). The 0.05 norm resulted in the list of essential genes in only 1 genetic background mentioned above.

The nature of these genes is yet to be investigated. However, that more genes are essential in the dbem1dbem3 background than the dbem3 background suggests that more genes become essential when genes are removed and the fitness of the cell is reduced.

Issues

Some issues and unexplained observations and tasks remain.

Compare the results of the updated classifier to the previous classifier by Wessel. With equal data it should reach about equal accuracy. So far it does not.
Double check if the resulting essential genes indeed appear essential based on the SATAY data
The read count per gene from one datafile does not correspond with the read count from another file. It remains unclear why for now. Maybe I mixed up the files or misunderstood the exact meaning of the information in the tables.
Data from seperate SATAY experiments with equal backgrounds has to be added together as well as compared
The resulting classification scores of both yTW001 and yLIC137 are strangely discrete. It is unclear to me why
Further increase of features may result in a better prediction power. However, we may not need this.
Currently we divide by 0 in a few cases when determining the feature NI20kb. This feature compares the transposon density in the local region with that within the gene. A possible solution would be to add 1 to the relevant array, so MATLAB does not return infinite values. This may help predictions.

Great @T-Wisse ! Very nice . I will outline some questions here:

Can you put in a table the genes scores as essentials in each background? And what is the score, or probability of each of them to be essential.
can you explain better the figures you present in the results part?
If I understand your accuracy is different than what Wessel got? if it is , it is higher or lower?

I think as next steps you should:

make a matrix where the rows are the unique set of all essential genes in WT and the rest of the backgrounds , and in each column a different background you analyze (including the WT), and you fill a 1 or zero whether that gene is essential or not in each background. A{i,j}-> i: all unique essential genes each background , j-> 1, if i is essential in background X, 0 if is not .

sanity checks:

the essential genes prediction from biological replicates should overlap more than 95%.
the output from the data Wessel used to train should give similar /identical output.

Thanks for the suggestions and thoughts. I will also update the post above to include some of these points. In short:

Yes, I have added a table with the scores for yLIC137_7 and yTW001_4. A higher score corresponds to a higher probability to be essential
Will edit the post to do so
Yes the accuracy I obtain here is lower than what Wessel got. Especially noticeable in the lower percentage of essential genes that are correctly predicted by my classifier. Wessel used different data than I have, which I am now looking for. I will run the classifier on that data, too. However, I don't really expect that to explain the entire difference. I will make some slight changes that I think could help, including how I use the WT data I have to train the classifier. Interestingly, I have previously obtained a better result using less features. I will have to see what changed. I might have introduced a bug somewhere.
I have added a table (.mat and .xlsx) which shows what I believe you suggest. The rows are the genes that are essential in at least 1 background. The columns show their predicted essentiality in WT, yLIC137 and yTW001_4. It also includes the annotated essential genes in WT.
I still have to check the differences between replicates.
Yes, I just need to find his data. On the other hand, I will also use his classifier run on my data for comparison

Thanks for the suggestions and thoughts. I will also update the post above to include some of these points. In short:

Yes, I have added a table with the scores for yLIC137_7 and yTW001_4. A higher score corresponds to a higher probability to be essential

Will edit the post to do so

Yes the accuracy I obtain here is lower than what Wessel got. Especially noticeable in the lower percentage of essential genes that are correctly predicted by my classifier. Wessel used different data than I have, which I am now looking for. I will run the classifier on that data, too. However, I don't really expect that to explain the entire difference. I will make some slight changes that I think could help, including how I use the WT data I have to train the classifier.

yes you should look into which factors makes the accuracy to go down in this type of model.

Interestingly, I have previously obtained a better result using less features. I will have to see what changed. I might have introduced a bug somewhere.

Did you compute the features or did you take how he implemented them?

I have added a table (.mat and .xlsx) which shows what I believe you suggest. The rows are the genes that are essential in at least 1 background. The columns show their predicted essentiality in WT, yLIC137 and yTW001_4. It also includes the annotated essential genes in WT.

Could you make a heatmap with some interesting genes ?

I still have to check the differences between replicates.

Yes, I just need to find his data. On the other hand, I will also use his classifier run on my data for comparison

T-Wisse / MEP_Thomas