QData / DeepChrome

Bioinformatics16: DeepChrome: Deep-learning for predicting gene expression from histone modifications
http://deepchrome.net
Apache License 2.0
62 stars 14 forks source link

Reference genes and their expression ... ? #2

Closed Noired closed 5 years ago

Noired commented 6 years ago

Dear Ritambhara, by reading your paper, I cannot get which criterion did you use to get the 19802 gene-samples constituting your dataset before train-test-val split. There is no reference to any annotation file and the procedure by which you associated those gene with the REMC expression quantifications is even darker to me.

In the other issue reported here you suggested that gene TSSs were retrieved by the table that can be constructed at this address. Could you please be more precise? the table got by setting:

clade: Mammal genome: Human assembly: Feb 2009 (GRCh37/hg19) group: Genes and Gene Predictions track: UCSC Genes (???) table: knownGene region: genome

has 82,960 rows, clearly much more than the genes you investigated.

As far as the gene expression quantifications are concerned, REMC says here that those were built by considering gencode v10 annotation file. The related files can be found here, but you were not clear about which of those files was used. It's unlikely that you used the file "57epigenomes.RPKM.pc.gz" because it contains less genes than those you considered (19795 vs 19802). Moreover, supposing you used the table retrieved as described above to select reference genes and their TSS, how did you translate the ucsc IDs into ensembl IDs in order to consider the right expression line for each gene?

Thank you for the support, I hope you will kindly help me in replicating your experiments.

Cheers, noired

rs3zz commented 6 years ago

Hello noired,

I will be happy to assist you with the data generation.

  1. When generating the table use "track: Ensembl Genes", that will give you the ensembl IDs (column "name2") that can be matched to those obtained in the REMC file.
  2. The table you get using the table browser might still have >19795 entries as there will be multiple entries of the same genes, as well as some other entries that are not of our use. The idea is to filter out all the extra information and use only the genes for which RPKM is reported for REMC data. We did this by matching the ensembl IDs (using join command in bash).
  3. We used the file "57epigenomes.RPKM.pc.gz" and 2-3 genes might have mapped twice during the matching process resulting in 19802 genes.

Please let me know if you have any more questions.

Regards Ritambhara

Noired commented 6 years ago

Thank you Ritambhara for your support. I am now going to try. In the meanwhile, one more question. Which field did you consider for the TSS of the selected genes? Is it the 'txStart' for "sense" strand and 'txStop' for "anti-sense" genes?

Best regards, noired

Noired commented 6 years ago

Dear Ritambhara, I am sorry, but I am still encountering difficulties in retrieving the reference genes.

I downloaded the table according to your suggestion by setting:

clade: Mammal genome: Human assembly: Feb 2009 (GRCh37/hg19) group: Genes and Gene Predictions track: Ensemble Genes table: ensGene region: genome

and the "57epigenomes.RPKM.pc.gz" file from REMC.

The problem I am specifically facing is that not all the 19795 genes from REMC are present in the table downloaded from the genome.ucsc website. Indeed by running a short python script, I realized:

Moreover, what does

and 2-3 genes might have mapped twice during the matching process

exactly mean?

As far as I can see from the output of the bash join command, all the genes matches multiple times, because, clearly, each of them has several associated lines in the table downloaded from genome.ucsc, as you also pointed out.

In the following, a picture showing the running of the script I mentioned before.

schermata 2018-01-02 alle 18 43 18

Do you have the list of the gene names (even ensembl) that you finally considered in your experiments?

Still hoping to get your support in retrieving the names of the reference genes and their TSS.

Cheers, noired

rs3zz commented 6 years ago

Hello noired,

Looking back into my notes, we also used the "gene info" file from here.

To make things easier, I am sharing the final gene file we used (with locations TSS+/-5000bp positions). GeneFile.txt

I will add it to the github repo as well. Please let me know if you still have questions.

Regards Ritambhara

Noired commented 6 years ago

Dear Rithambara.

Please, do not add the GeneFile.txt file to the github repo. The locations are not computed as TSS+/-5000bp, as their differences equal 20k, not 10k as expected. As an example, take the first line:

chrX 99884988 99904988 ENSG00000000003_TSPAN6

it is easy to see how the difference stop-start is 20k: 99904988 - 99884988 = 20000.

Are you really sure this is the final gene file that was used?

Am I missing some important fact...?

noired

rs3zz commented 6 years ago

Hello noired,

That is not an issue at all, this file is TSS+/-10,000 bp instead of 5kb, I apologize for the confusion. I will update the file, meanwhile, you can add/subtract 5kbp to obtain the original file. Everything else is correct.

Regards Ritambhara

Noired commented 6 years ago

Dear Rithambara, I would like to finally ask you one more question. How did you perform the fair train-val-test split? Which third of the dataset was specifically assigned to training, which third to validation and which third to testing? Did you proceed randomly? Did you use a particular seed?

noired

rs3zz commented 6 years ago

Hello noired,

The split was straightforward using bash split command with lines 6601,6601, and 6600 respectively. Top 6601 were training, middle 6601 were validation and last 6600 were test samples.

Regards Ritambhara