Import ENCODE Signal Peak data

saliksyed commented 6 years ago

The task is to parse the ENCODE encyclopedia of DNA elements and create GenomeNodes for them:

Download all data here: Data Matrix [Use BED format files, not bigBed]

Color codes specify the type of regulatory element: 255,0,0 = Red = Promoter-like 255,205,0 = Orange = Enhancer-like 0,176,240 = Blue = CTCF-only 6,218,147 = Green = DNase-only 225,225,225 = Light Gray = Inactive 140, 140, 140 = Dark Gray = Unclassified

NOTE: We will also want to create InfoNodes for each cell-type

yudongqiu commented 6 years ago

[x] Automated downloading scripts with metadata fetching
[x] BEDParser that creates GenomeNodes for ENCODE data.

yudongqiu commented 6 years ago

Step 1 notes:

The 1072 interested data files are listed in this search result: https://www.encodeproject.org/search/?type=Annotation&encyclopedia_version=4&files.file_type=bed+bed3%2B&assembly=hg19&organism.scientific_name=Homo+sapiens&limit=all

The downloading scripts will use the REST API from ENCODE to fetch the search resutls, get metadata, and file URLs.

yudongqiu commented 6 years ago

Step 2 notes:

Each file will be downloaded as .bed file of size ~ 90M. Parsing the file will give ~1.3M GenomeNodes, and 1 InfoNode. The GenomeNode looks like this:

{
  '_id': 'G_a23bdfd5c1f0efb4e9cac6cca3422ea1d3c457d7e7550c13833653b15d354df1',
  'assembly': 'GRCh37',
  'chromid': 1,
  'start': 10244,
  'end': 10357,
  'length': 114,
  'name': 'EH37E1055273',
  'source': ['ENCODE'],
  'type': 'Inactive',
  'info': {
    'accession': 'ENCSR720NXO',
    'biosample': 'large intestine',
    'score': '0',
    'strand': '.',
    'targets': [],
    'thickEnd': '10357',
    'thickStart': '10244'
  }
}

Note: the 'info.targets' can be zero, one or more of binding targets like ['CTCF', 'H3K4me3', 'H3K27ac']. This flexibility will make the /distinct_values api slow, so we might want to pre-define these strings in our frontend.

The InfoNode looks like this:

{
  '_id': 'I_ENCSR506ZUJ',
  'name': 'ENCSR506ZUJ',
  'source': ['ENCODE'],
  'type': 'ENCODE_accession',
  'info': {
    'accession': 'ENCSR506ZUJ',
    'assembly': 'GRCh37',
    'biosample': 'amnion',
    'description': '9-state high H3K27ac for amnion male fetal (16 weeks)',
    'filename': 'ENCFF593JKZ.bed',
    'sourceurl': 'https://www.encodeproject.org/files/ENCFF593JKZ/@@download/ENCFF593JKZ.bed.gz',
    'targets': ['H3K27ac']
  }
}

In my test, parsing each of the .bed file and uploading to MongoDB will increase the disk usage of about 0.5G, therefore to upload them all, we will need 500GB additional disk space for the MongoDB container. Our Kubernetes on Google Cloud is configured to claim 100G for each mongo container. I can increase it to 1000GB if needed. On our local copy, it is not common that one will have such amount of storage on a PC. Based on this, I will make the script only parse and upload 10 files by default, with a very simple way of changing the limit.

yudongqiu commented 6 years ago

Finished with 232e7adaaa3926fb470a2b792d14811c307d18b2

saliksyed commented 6 years ago

Closing this issue, looks like we are on track to hit this goal 👍

VALIS-software / SIRIUS-backend

Import ENCODE Signal Peak data #20