Closed lukasjelonek closed 10 months ago
This will require the following updates to the upload client:
This will require the following changes to the bakrep server:
This will require the following changes to the bakrep website:
[x] Add metadata to dataset ~summary~ page
This will require the following manual steps:
[x] Upload all metadata.json files
[x] Review metadata.json file content
[x] ~Trigger index rebuild and reconfigure server to use new index~ -> Not needed as it was possible to update the mapping without recomputation
[x] Update search index with metadata (file upload and indexing have been separated, so this is an extra point)
As the elasticsearch index must be rebuild, I wonder how this can be done.
At the moment I see a few options:
As new data has to be uploaded to the s3 bucket, I prefer option two. It won't require changes to the server code until all data is available. Then all that has to be done is upload the new mapping, wait for a while and finally to change the index in the server.
Some research showed that it should be possible to add new fields to the mapping without creating a new index: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html. If it works as expected there is a fourth option: Upload everything and update the mapping or update the mapping and upload everything. I will prefer this option.
I evaluated if it is possible to add fields to the mapping without reindexing. It works for new fields that are part of the mapping yet. When the mapping is dynamic and there already exist documents that have the new field, then there will already be a mapping for the field and a reindexing will be required.
For bakrep we use a dynamic mapping with some preconfigured fields. So before adding new documents, the mapping should be updated to include all new fields.
The currently available metadata files only contain strings. This may cause some problems for data processing.
Example:
{
"SAMD00000550": {
"study_accession": "PRJDB1732",
"run_accession": "DRR041181",
"project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
"isolation_source": "",
"instrument_platform": "ILLUMINA",
"host": "",
"first_public": "2017-02-04",
"country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
"collection_date": "2003-10-08",
"center_name": "YAMAGU_U",
"accession": "SAMD00000550",
"bio_material": "",
"broker_name": "",
"collected_by": "",
"culture_collection": "",
"depth": "",
"environment_biome": "Chicken farm",
"environment_feature": "land, farm",
"environment_material": "soil contaminated chicken manure",
"environmental_package": "",
"environmental_sample": "False",
"host_sex": "",
"host_status": "",
"host_tax_id": "NA",
"instrument_model": "Illumina HiSeq 2000",
"isolate": "",
"lat": "14.24",
"location": "14.24 N 99.51 E",
"lon": "99.51",
"sample_alias": "DRS040181",
"secondary_sample_accession": "DRS040181",
"secondary_study_accession": "DRP003440",
"serotype": "",
"serovar": "",
"strain": "CS176",
"study_alias": "DRP003440",
"study_title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
"sub_strain": "",
"submission_accession": "DRA003797"
}
}
Would be better as:
{
"id": "SAMD00000550",
"run_accession": "DRR041181",
"project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
"isolation_source": null,
"instrument_platform": "ILLUMINA",
"host": null,
"first_public": "2017-02-04",
"country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
"collection_date": "2003-10-08",
"center_name": "YAMAGU_U",
"accession": "SAMD00000550",
"bio_material": null,
"broker_name": null,
"collected_by": null,
"culture_collection": null,
"depth": null,
"environment": {
"biome": "Chicken farm",
"feature": "land, farm",
"material": "soil contaminated chicken manure",
},
"environmental_package": null,
"environmental_sample": false,
"host": {
"sex": null,
"status": null,
"tax_id": null,
},
"instrument_model": "Illumina HiSeq 2000",
"isolate": null,
"location": {
"lon": 99.51,
"lat": 14.24,
},
"sample_alias": "DRS040181",
"secondary_sample_accession": "DRS040181",
"secondary_study_accession": "DRP003440",
"serotype": null,
"serovar": null,
"strain": "CS176",
"study": {
"accession": "PRJDB1732",
"alias": "DRP003440",
"title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
},
"sub_strain": null,
"submission_accession": "DRA003797"
}
It would also be an option to make null values optional:
{
"id": "SAMD00000550",
"run_accession": "DRR041181",
"project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
"instrument_platform": "ILLUMINA",
"first_public": "2017-02-04",
"country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
"collection_date": "2003-10-08",
"center_name": "YAMAGU_U",
"accession": "SAMD00000550",
"environment": {
"biome": "Chicken farm",
"feature": "land, farm",
"material": "soil contaminated chicken manure",
},
"environmental_sample": false,
"instrument_model": "Illumina HiSeq 2000",
"location": {
"lon": 99.51,
"lat": 14.24,
},
"sample_alias": "DRS040181",
"secondary_sample_accession": "DRS040181",
"secondary_study_accession": "DRP003440",
"strain": "CS176",
"study": {
"accession": "PRJDB1732",
"alias": "DRP003440",
"title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
},
"submission_accession": "DRA003797"
}
The location can be stored as geojson:
{
"type": "Point",
"coordinates": [99.51, 14.24]
}
After offline discussion we decided to use this format:
{
"id": "SAMD00000550",
"study_accession": "PRJDB1732",
"run_accession": "DRR041181",
"project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
"isolation_source": null,
"instrument_platform": "ILLUMINA",
"host": null,
"first_public": "2017-02-04",
"country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
"collection_date": "2003-10-08",
"center_name": "YAMAGU_U",
"accession": "SAMD00000550",
"bio_material": null,
"broker_name": null,
"collected_by": null,
"culture_collection": null,
"depth": null,
"environment_biome": "Chicken farm",
"environment_feature": "land, farm",
"environment_material": "soil contaminated chicken manure",
"environmental_package": null,
"environmental_sample": false,
"host_sex": null,
"host_status": null,
"host_tax_id": null,
"instrument_model": "Illumina HiSeq 2000",
"isolate": null,
"location": {
"type": "Point",
"coordinates": [99.51, 14.24]
},
"sample_alias": "DRS040181",
"secondary_sample_accession": "DRS040181",
"secondary_study_accession": "DRP003440",
"serotype": null,
"serovar": null,
"strain": "CS176",
"study_alias": "DRP003440",
"study_title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
"sub_strain": null,
"submission_accession": "DRA003797"
}
I propose a new command update
that can apply one or multiple changes to a single dataset:
Each of these operations should map to a parameter:
bakrep update --id xy --add-file path:attr1=xy,attr2=jj
bakrep update --id xy --add-external-url url.json
{
"url": "http://example.com/myfile.fna.gz",
"md5": "abcd",
"size": 123,
"attributes": {
"type": "assembly",
"filetype": "fna"
}
}
bakrep update --id xy --remove-entries attr1=xy,attr2=jj
bakrep update --id xy --update-index newindex.json
# Obtain the assembly links for all datasets (external script)
# Compute the md5-sum and sizes for all datasets (external script)
# For each dataset create an external url json file with annotation: `type=assembly,filetype=fa` (external script)
# For each dataset generate a new index, including the metadata
bakrep index --id xy -f [... all files that have data for the index ...] -o xy.index.json
# For each dataset
bakrep update --id xy --add-file xy.metadata.json:type=metadata,filetype=json --add-url xy.assembly.json --update-index xy.index.json
Another option would be to update the files and urls with the update command and to update the indexes with the indexes batch update script, as this may be faster.
At the moment result deletion only works for file names. Currently this is sufficient. If deletion by attribute sets is required at a later moment, it will be implemented then.
The index refresh is disabled during the batch update.
Each dataset should contain a metadata file. Parts of the metadata should be searchable. The metadata should be visible on the dataset page.