lukasjelonek commented 11 months ago

Each dataset should contain a metadata file. Parts of the metadata should be searchable. The metadata should be visible on the dataset page.

lukasjelonek commented 11 months ago

This will require the following updates to the upload client:

[x] Allow upload of files without changing the indexed data
[x] Add indexer for metadata
[x] skip files that are already uploaded and have the same name and hash value

This will require the following changes to the bakrep server:

[x] Update elasticsearch mapping to include the metadata fields
[x] Add metadata fields to searchable fields
[x] ~Rebuild the whole index with the new mapping~ -> Not needed as it was possible to update the mapping without recomputation

This will require the following changes to the bakrep website:

[x] Add metadata to dataset ~summary~ page

This will require the following manual steps:
[x] Upload all metadata.json files
[x] Review metadata.json file content
[x] ~Trigger index rebuild and reconfigure server to use new index~ -> Not needed as it was possible to update the mapping without recomputation
[x] Update search index with metadata (file upload and indexing have been separated, so this is an extra point)

lukasjelonek commented 11 months ago

Update strategy

As the elasticsearch index must be rebuild, I wonder how this can be done.

At the moment I see a few options:

New Index upload
- Create a second server with another url and the new elastic index configuration.
- configure the uploader to upload the data to the index data to the new index
- Once everything is ready, reconfigure the production deployment to use the new index
- delete old index
Let elasticsearch create a new index
- Upload all index changes to the existing server
- trigger elasticsearch reindex to a new index
- once reindexing is done, reconfigure the server to use the new index
- delete old index
Batch upload to new index
- create new elasticsearch index with new mapping
- compute all index files for all datasets
- batch-upload all index files to new index
- configure server to use new index
- delete old index

As new data has to be uploaded to the s3 bucket, I prefer option two. It won't require changes to the server code until all data is available. Then all that has to be done is upload the new mapping, wait for a while and finally to change the index in the server.

Some research showed that it should be possible to add new fields to the mapping without creating a new index: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html. If it works as expected there is a fourth option: Upload everything and update the mapping or update the mapping and upload everything. I will prefer this option.

lukasjelonek commented 11 months ago

I evaluated if it is possible to add fields to the mapping without reindexing. It works for new fields that are part of the mapping yet. When the mapping is dynamic and there already exist documents that have the new field, then there will already be a mapping for the field and a reindexing will be required.

For bakrep we use a dynamic mapping with some preconfigured fields. So before adding new documents, the mapping should be updated to include all new fields.

lukasjelonek commented 11 months ago

The currently available metadata files only contain strings. This may cause some problems for data processing.

Example:

{
  "SAMD00000550": {
    "study_accession": "PRJDB1732",
    "run_accession": "DRR041181",
    "project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
    "isolation_source": "",
    "instrument_platform": "ILLUMINA",
    "host": "",
    "first_public": "2017-02-04",
    "country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
    "collection_date": "2003-10-08",
    "center_name": "YAMAGU_U",
    "accession": "SAMD00000550",
    "bio_material": "",
    "broker_name": "",
    "collected_by": "",
    "culture_collection": "",
    "depth": "",
    "environment_biome": "Chicken farm",
    "environment_feature": "land, farm",
    "environment_material": "soil contaminated chicken manure",
    "environmental_package": "",
    "environmental_sample": "False",
    "host_sex": "",
    "host_status": "",
    "host_tax_id": "NA",
    "instrument_model": "Illumina HiSeq 2000",
    "isolate": "",
    "lat": "14.24",
    "location": "14.24 N 99.51 E",
    "lon": "99.51",
    "sample_alias": "DRS040181",
    "secondary_sample_accession": "DRS040181",
    "secondary_study_accession": "DRP003440",
    "serotype": "",
    "serovar": "",
    "strain": "CS176",
    "study_alias": "DRP003440",
    "study_title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
    "sub_strain": "",
    "submission_accession": "DRA003797"
  }
}

Would be better as:

{
  "id": "SAMD00000550",
  "run_accession": "DRR041181",
  "project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  "isolation_source": null,
  "instrument_platform": "ILLUMINA",
  "host": null,
  "first_public": "2017-02-04",
  "country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
  "collection_date": "2003-10-08",
  "center_name": "YAMAGU_U",
  "accession": "SAMD00000550",
  "bio_material": null,
  "broker_name": null,
  "collected_by": null,
  "culture_collection": null,
  "depth": null,
  "environment": {
    "biome": "Chicken farm",
    "feature": "land, farm",
    "material": "soil contaminated chicken manure",
  },
  "environmental_package": null,
  "environmental_sample": false,
  "host": {
    "sex": null,
    "status": null,
    "tax_id": null,
  },
  "instrument_model": "Illumina HiSeq 2000",
  "isolate": null,
  "location": {
    "lon": 99.51,
    "lat": 14.24,
  },
  "sample_alias": "DRS040181",
  "secondary_sample_accession": "DRS040181",
  "secondary_study_accession": "DRP003440",
  "serotype": null,
  "serovar": null,
  "strain": "CS176",
  "study": {
    "accession": "PRJDB1732",
    "alias": "DRP003440",
    "title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  },
  "sub_strain": null,
  "submission_accession": "DRA003797"
}

It would also be an option to make null values optional:

{
  "id": "SAMD00000550",
  "run_accession": "DRR041181",
  "project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  "instrument_platform": "ILLUMINA",
  "first_public": "2017-02-04",
  "country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
  "collection_date": "2003-10-08",
  "center_name": "YAMAGU_U",
  "accession": "SAMD00000550",
  "environment": {
    "biome": "Chicken farm",
    "feature": "land, farm",
    "material": "soil contaminated chicken manure",
  },
  "environmental_sample": false,
  "instrument_model": "Illumina HiSeq 2000",
  "location": {
    "lon": 99.51,
    "lat": 14.24,
  },
  "sample_alias": "DRS040181",
  "secondary_sample_accession": "DRS040181",
  "secondary_study_accession": "DRP003440",
  "strain": "CS176",
  "study": {
    "accession": "PRJDB1732",
    "alias": "DRP003440",
    "title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  },
  "submission_accession": "DRA003797"
}

lukasjelonek commented 11 months ago

The location can be stored as geojson:

{
  "type": "Point",
  "coordinates": [99.51, 14.24]
}

lukasjelonek commented 11 months ago

After offline discussion we decided to use this format:

{
  "id": "SAMD00000550",
  "study_accession": "PRJDB1732",
  "run_accession": "DRR041181",
  "project_name": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  "isolation_source": null,
  "instrument_platform": "ILLUMINA",
  "host": null,
  "first_public": "2017-02-04",
  "country": "Thailand: Bangkok, Suwanvajokkasikit Farm, Kasetsart University",
  "collection_date": "2003-10-08",
  "center_name": "YAMAGU_U",
  "accession": "SAMD00000550",
  "bio_material": null,
  "broker_name": null,
  "collected_by": null,
  "culture_collection": null,
  "depth": null,
  "environment_biome": "Chicken farm",
  "environment_feature": "land, farm",
  "environment_material": "soil contaminated chicken manure",
  "environmental_package": null,
  "environmental_sample": false,
  "host_sex": null,
  "host_status": null,
  "host_tax_id": null,
  "instrument_model": "Illumina HiSeq 2000",
  "isolate": null,
  "location": {
      "type": "Point",
      "coordinates": [99.51, 14.24]
   },
  "sample_alias": "DRS040181",
  "secondary_sample_accession": "DRS040181",
  "secondary_study_accession": "DRP003440",
  "serotype": null,
  "serovar": null,
  "strain": "CS176",
  "study_alias": "DRP003440",
  "study_title": "Corynebacterium glutamicum CS176 strain genome sequencing project",
  "sub_strain": null,
  "submission_accession": "DRA003797"
}

lukasjelonek commented 11 months ago

Changes to the upload client

I propose a new command update that can apply one or multiple changes to a single dataset:

add new file
add new external url
replace existing file
delete results based on attributes
replace current index

Each of these operations should map to a parameter:

Operations

Add a file / replace an existing file

bakrep update --id xy --add-file path:attr1=xy,attr2=jj

Add a url / replace existing url

bakrep update --id xy --add-external-url url.json

{
  "url": "http://example.com/myfile.fna.gz",
  "md5": "abcd",
  "size": 123,
  "attributes": {
    "type": "assembly",
    "filetype": "fna"
  }
}

Delete entry

bakrep update --id xy --remove-entries attr1=xy,attr2=jj

Update index

bakrep update --id xy --update-index newindex.json

How to update the metadata and how to include ena links with this proposal


# Obtain the assembly links for all datasets (external script)
# Compute the md5-sum and sizes for all datasets (external script)
# For each dataset create an external url json file with annotation: `type=assembly,filetype=fa` (external script)
# For each dataset generate a new index, including the metadata
bakrep index --id xy -f [... all files that have data for the index ...] -o  xy.index.json

# For each dataset
bakrep update --id xy --add-file xy.metadata.json:type=metadata,filetype=json --add-url xy.assembly.json --update-index xy.index.json

Another option would be to update the files and urls with the update command and to update the indexes with the indexes batch update script, as this may be faster.

lukasjelonek commented 10 months ago

At the moment result deletion only works for file names. Currently this is sufficient. If deletion by attribute sets is required at a later moment, it will be implemented then.

lukasjelonek commented 10 months ago

The index refresh is disabled during the batch update.

[x] enable index refreshing when batch update is finished

ag-computational-bio / bakrep-web

Add the metadata to the website #51

Update strategy

Changes to the upload client

Operations

Add a file / replace an existing file

Add a url / replace existing url

Delete entry

Update index

How to update the metadata and how to include ena links with this proposal