Closed Noired closed 5 years ago
Hello noired,
I will be happy to assist you with the data generation.
Please let me know if you have any more questions.
Regards Ritambhara
Thank you Ritambhara for your support. I am now going to try. In the meanwhile, one more question. Which field did you consider for the TSS of the selected genes? Is it the 'txStart' for "sense" strand and 'txStop' for "anti-sense" genes?
Best regards, noired
Dear Ritambhara, I am sorry, but I am still encountering difficulties in retrieving the reference genes.
I downloaded the table according to your suggestion by setting:
clade: Mammal genome: Human assembly: Feb 2009 (GRCh37/hg19) group: Genes and Gene Predictions track: Ensemble Genes table: ensGene region: genome
and the "57epigenomes.RPKM.pc.gz" file from REMC.
The problem I am specifically facing is that not all the 19795 genes from REMC are present in the table downloaded from the genome.ucsc website. Indeed by running a short python script, I realized:
Moreover, what does
and 2-3 genes might have mapped twice during the matching process
exactly mean?
As far as I can see from the output of the bash join command, all the genes matches multiple times, because, clearly, each of them has several associated lines in the table downloaded from genome.ucsc, as you also pointed out.
In the following, a picture showing the running of the script I mentioned before.
Do you have the list of the gene names (even ensembl) that you finally considered in your experiments?
Still hoping to get your support in retrieving the names of the reference genes and their TSS.
Cheers, noired
Hello noired,
Looking back into my notes, we also used the "gene info" file from here.
To make things easier, I am sharing the final gene file we used (with locations TSS+/-5000bp positions). GeneFile.txt
I will add it to the github repo as well. Please let me know if you still have questions.
Regards Ritambhara
Dear Rithambara.
Please, do not add the GeneFile.txt file to the github repo. The locations are not computed as TSS+/-5000bp, as their differences equal 20k, not 10k as expected. As an example, take the first line:
chrX 99884988 99904988 ENSG00000000003_TSPAN6
it is easy to see how the difference stop-start is 20k: 99904988 - 99884988 = 20000.
Are you really sure this is the final gene file that was used?
Am I missing some important fact...?
noired
Hello noired,
That is not an issue at all, this file is TSS+/-10,000 bp instead of 5kb, I apologize for the confusion. I will update the file, meanwhile, you can add/subtract 5kbp to obtain the original file. Everything else is correct.
Regards Ritambhara
Dear Rithambara, I would like to finally ask you one more question. How did you perform the fair train-val-test split? Which third of the dataset was specifically assigned to training, which third to validation and which third to testing? Did you proceed randomly? Did you use a particular seed?
noired
Hello noired,
The split was straightforward using bash split command with lines 6601,6601, and 6600 respectively. Top 6601 were training, middle 6601 were validation and last 6600 were test samples.
Regards Ritambhara
Dear Ritambhara, by reading your paper, I cannot get which criterion did you use to get the 19802 gene-samples constituting your dataset before train-test-val split. There is no reference to any annotation file and the procedure by which you associated those gene with the REMC expression quantifications is even darker to me.
In the other issue reported here you suggested that gene TSSs were retrieved by the table that can be constructed at this address. Could you please be more precise? the table got by setting:
clade: Mammal genome: Human assembly: Feb 2009 (GRCh37/hg19) group: Genes and Gene Predictions track: UCSC Genes (???) table: knownGene region: genome
has 82,960 rows, clearly much more than the genes you investigated.
As far as the gene expression quantifications are concerned, REMC says here that those were built by considering gencode v10 annotation file. The related files can be found here, but you were not clear about which of those files was used. It's unlikely that you used the file "57epigenomes.RPKM.pc.gz" because it contains less genes than those you considered (19795 vs 19802). Moreover, supposing you used the table retrieved as described above to select reference genes and their TSS, how did you translate the ucsc IDs into ensembl IDs in order to consider the right expression line for each gene?
Thank you for the support, I hope you will kindly help me in replicating your experiments.
Cheers, noired