EnsemblGSOC / compara-deep-learning

Using Deep Learning techniques to enhance orthology calls
8 stars 7 forks source link

Fetch all chromosomes and genes #1

Closed mateuspatricio closed 5 years ago

mateuspatricio commented 5 years ago

The goal is to download all the genes from all chromosomes to be used to build the Local Synteny (Matrix). The following flat-file should be used: ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/

HarshitGupta11 commented 5 years ago

Homologies are shared across different organisms so isn't it better to just load the databases with all the genes sorted in order to create the matrices and request the data?

Saving everything in the disk is not a problem but in the memory(RAM) might create computational complexities as I loaded only one file with all the cDNA sequences and it takes around 2 gigs.

Even saving the complete databases might not be a good idea because of the no. of organisms rather It would be better to just store the gene ids and a dictionary mapping to their indexes.

mateuspatricio commented 5 years ago

I based this task on your ideas of the first steps. I guess it can be renamed to fetch data or something similar. Replying to your question, I think that's the idea no? To load all the genes in a sorted manner.

HarshitGupta11 commented 5 years ago

Yup, It will be easy to get the neighboring genes on the basis of the position if they are already sorted.

HarshitGupta11 commented 5 years ago

I have implemented the process to download, read and load all the data. Also, the part to get the next -/+ genes of the current gene has been done.

HarshitGupta11 commented 5 years ago

This task has been completed and all the respective files are added in the repository.