ag-computational-bio / bakrep-web

The user interface for bakrep
1 stars 0 forks source link

Add the metadata to the website #51

Closed lukasjelonek closed 10 months ago

lukasjelonek commented 11 months ago

Each dataset should contain a metadata file. Parts of the metadata should be searchable. The metadata should be visible on the dataset page.

lukasjelonek commented 11 months ago

This will require the following updates to the upload client:

This will require the following changes to the bakrep server:

This will require the following changes to the bakrep website:

lukasjelonek commented 11 months ago

Update strategy

As the elasticsearch index must be rebuild, I wonder how this can be done.

At the moment I see a few options:

  1. New Index upload
    • Create a second server with another url and the new elastic index configuration.
    • configure the uploader to upload the data to the index data to the new index
    • Once everything is ready, reconfigure the production deployment to use the new index
    • delete old index
  2. Let elasticsearch create a new index
    • Upload all index changes to the existing server
    • trigger elasticsearch reindex to a new index
    • once reindexing is done, reconfigure the server to use the new index
    • delete old index
  3. Batch upload to new index
    • create new elasticsearch index with new mapping
    • compute all index files for all datasets
    • batch-upload all index files to new index
    • configure server to use new index
    • delete old index

As new data has to be uploaded to the s3 bucket, I prefer option two. It won't require changes to the server code until all data is available. Then all that has to be done is upload the new mapping, wait for a while and finally to change the index in the server.

Some research showed that it should be possible to add new fields to the mapping without creating a new index: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html. If it works as expected there is a fourth option: Upload everything and update the mapping or update the mapping and upload everything. I will prefer this option.

lukasjelonek commented 11 months ago

I evaluated if it is possible to add fields to the mapping without reindexing. It works for new fields that are part of the mapping yet. When the mapping is dynamic and there already exist documents that have the new field, then there will already be a mapping for the field and a reindexing will be required.

For bakrep we use a dynamic mapping with some preconfigured fields. So before adding new documents, the mapping should be updated to include all new fields.

lukasjelonek commented 11 months ago

The currently available metadata files only contain strings. This may cause some problems for data processing.

Example:

{
  "SAMD00000550": {
    "study_accession": "PRJDB1732",
    "run_accession": "DRR041181",
    "project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
    "isolation_source": "",
    "instrument_platform": "ILLUMINA",
    "host": "",
    "first_public": "2017-02-04",
    "country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
    "collection_date": "2003-10-08",
    "center_name": "YAMAGU_U",
    "accession": "SAMD00000550",
    "bio_material": "",
    "broker_name": "",
    "collected_by": "",
    "culture_collection": "",
    "depth": "",
    "environment_biome": "Chicken farm",
    "environment_feature": "land, farm",
    "environment_material": "soil contaminated chicken manure",
    "environmental_package": "",
    "environmental_sample": "False",
    "host_sex": "",
    "host_status": "",
    "host_tax_id": "NA",
    "instrument_model": "Illumina HiSeq 2000",
    "isolate": "",
    "lat": "14.24",
    "location": "14.24 N 99.51 E",
    "lon": "99.51",
    "sample_alias": "DRS040181",
    "secondary_sample_accession": "DRS040181",
    "secondary_study_accession": "DRP003440",
    "serotype": "",
    "serovar": "",
    "strain": "CS176",
    "study_alias": "DRP003440",
    "study_title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
    "sub_strain": "",
    "submission_accession": "DRA003797"
  }
}

Would be better as:

{
  "id": "SAMD00000550",
  "run_accession": "DRR041181",
  "project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  "isolation_source": null,
  "instrument_platform": "ILLUMINA",
  "host": null,
  "first_public": "2017-02-04",
  "country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
  "collection_date": "2003-10-08",
  "center_name": "YAMAGU_U",
  "accession": "SAMD00000550",
  "bio_material": null,
  "broker_name": null,
  "collected_by": null,
  "culture_collection": null,
  "depth": null,
  "environment": {
    "biome": "Chicken farm",
    "feature": "land, farm",
    "material": "soil contaminated chicken manure",
  },
  "environmental_package": null,
  "environmental_sample": false,
  "host": {
    "sex": null,
    "status": null,
    "tax_id": null,
  },
  "instrument_model": "Illumina HiSeq 2000",
  "isolate": null,
  "location": {
    "lon": 99.51,
    "lat": 14.24,
  },
  "sample_alias": "DRS040181",
  "secondary_sample_accession": "DRS040181",
  "secondary_study_accession": "DRP003440",
  "serotype": null,
  "serovar": null,
  "strain": "CS176",
  "study": {
    "accession": "PRJDB1732",
    "alias": "DRP003440",
    "title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  },
  "sub_strain": null,
  "submission_accession": "DRA003797"
}

It would also be an option to make null values optional:

{
  "id": "SAMD00000550",
  "run_accession": "DRR041181",
  "project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  "instrument_platform": "ILLUMINA",
  "first_public": "2017-02-04",
  "country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
  "collection_date": "2003-10-08",
  "center_name": "YAMAGU_U",
  "accession": "SAMD00000550",
  "environment": {
    "biome": "Chicken farm",
    "feature": "land, farm",
    "material": "soil contaminated chicken manure",
  },
  "environmental_sample": false,
  "instrument_model": "Illumina HiSeq 2000",
  "location": {
    "lon": 99.51,
    "lat": 14.24,
  },
  "sample_alias": "DRS040181",
  "secondary_sample_accession": "DRS040181",
  "secondary_study_accession": "DRP003440",
  "strain": "CS176",
  "study": {
    "accession": "PRJDB1732",
    "alias": "DRP003440",
    "title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  },
  "submission_accession": "DRA003797"
}
lukasjelonek commented 11 months ago

The location can be stored as geojson:

{
  "type": "Point",
  "coordinates": [99.51, 14.24]
}
lukasjelonek commented 11 months ago

After offline discussion we decided to use this format:

{
  "id": "SAMD00000550",
  "study_accession": "PRJDB1732",
  "run_accession": "DRR041181",
  "project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  "isolation_source": null,
  "instrument_platform": "ILLUMINA",
  "host": null,
  "first_public": "2017-02-04",
  "country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
  "collection_date": "2003-10-08",
  "center_name": "YAMAGU_U",
  "accession": "SAMD00000550",
  "bio_material": null,
  "broker_name": null,
  "collected_by": null,
  "culture_collection": null,
  "depth": null,
  "environment_biome": "Chicken farm",
  "environment_feature": "land, farm",
  "environment_material": "soil contaminated chicken manure",
  "environmental_package": null,
  "environmental_sample": false,
  "host_sex": null,
  "host_status": null,
  "host_tax_id": null,
  "instrument_model": "Illumina HiSeq 2000",
  "isolate": null,
  "location": {
      "type": "Point",
      "coordinates": [99.51, 14.24]
   },
  "sample_alias": "DRS040181",
  "secondary_sample_accession": "DRS040181",
  "secondary_study_accession": "DRP003440",
  "serotype": null,
  "serovar": null,
  "strain": "CS176",
  "study_alias": "DRP003440",
  "study_title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  "sub_strain": null,
  "submission_accession": "DRA003797"
}
lukasjelonek commented 11 months ago

Changes to the upload client

I propose a new command update that can apply one or multiple changes to a single dataset:

Each of these operations should map to a parameter:

Operations

Add a file / replace an existing file

bakrep update --id xy --add-file path:attr1=xy,attr2=jj

Add a url / replace existing url

bakrep update --id xy --add-external-url url.json
{
  "url": "http://example.com/myfile.fna.gz",
  "md5": "abcd",
  "size": 123,
  "attributes": {
    "type": "assembly",
    "filetype": "fna"
  }
}

Delete entry

bakrep update --id xy --remove-entries attr1=xy,attr2=jj

Update index

bakrep update --id xy --update-index newindex.json

How to update the metadata and how to include ena links with this proposal


# Obtain the assembly links for all datasets (external script)
# Compute the md5-sum and sizes for all datasets (external script)
# For each dataset create an external url json file with annotation: `type=assembly,filetype=fa` (external script)
# For each dataset generate a new index, including the metadata
bakrep index --id xy -f [... all files that have data for the index ...] -o  xy.index.json

# For each dataset
bakrep update --id xy --add-file xy.metadata.json:type=metadata,filetype=json --add-url xy.assembly.json --update-index xy.index.json

Another option would be to update the files and urls with the update command and to update the indexes with the indexes batch update script, as this may be faster.

lukasjelonek commented 10 months ago

At the moment result deletion only works for file names. Currently this is sufficient. If deletion by attribute sets is required at a later moment, it will be implemented then.

lukasjelonek commented 10 months ago

The index refresh is disabled during the batch update.