known and measured essential genes in depth characterization

leilaicruz commented 4 years ago

[ ] From the code , I could not get this : What is the proportion of genes from the genes_boolean_list and the known_essential_genes_list? (common genes, and genes that belong to each of these subsets separately) and what is the correlation with the transposon counts ?

Gregory94 commented 4 years ago

Basically, we got two lists of genes that we have to deal with. One is a list of all genes that were analyzed (which is ideally every gene in the genome) (stored in the variable gene_name_list). Secondly we got a list of all genes that we know are essential (which are stored in the variable known_essential_genes_list). For further processing and plotting, I wanted to know which of the genes in the gene_names_list variable also occurred in the known_essential_genes_list (i.e. which of all genes analyzed are essential). The genes from the gene_names_list that occur in the known_essential_genes_list are marked as True (they are essential) and the genes that are not present are marked False (they are not essential). This information is stored in the variable genes_boolean_list. Ideally. The genes which are marked True have little transposons inserted. To check this, I made the violin plot which separates the data based on the True/False markings. You would expect that the average number of transposon insertions of the genes marked True is lower compared with the genes marked False.

leilaicruz commented 4 years ago

Yes , clear , so what is interesting is to know which genes (identity) are the ones that are both , measured and previously annotated as essentials (your True list) , those who are just measured but not annotated before and the ones that were not measured as essentials but they are annotated as so. This is to keep track on the identity of these (I imagine) few cases where there is a mismatch.

Gregory94 commented 4 years ago

But then we run into the question of how to define a gene as essential based on our results? Should we simply set a threshold, so that if a gene has less than x transposons and/or reads, it is deemed essential. Or, I was thinking of using Bayes theorem. This is a mathematical approach of calculating the probability that a hypothesis is true based on some evidence. So in our case this would mean calculating the probability that a gene is essential based on the number of transposon insertions it has.

leilaicruz commented 4 years ago

You could set a threshold where you know that at least 90% of the genes already known as essential can fall into , and use this as working criteria for essentiality. For the application of Bayes theorem , you would have to propose also the probability of having x transposon counts (the likelihood of the data) per essential gene , and also your a priori distribution on gene essentiality, right? You can try, and we discuss later, but I think our a priori knowledge of when a gene is essential or not is very limited, so , I am not sure about the power here to apply Bayes... I was looking in this website but if you have another view , please let me know :)

leilaicruz commented 4 years ago

Hi Greg, I made this plot, with your data on essential genes True , to see how the transposon counts vary and where we can set a decent threshold

possible-thereshold-essentiality-transposon-counts

SATAY-LL / LaanLab-SATAY-DataAnalysis

known and measured essential genes in depth characterization #1