HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
65 stars 32 forks source link

Add `file_core.is_auxiliary` #579

Closed hannes-ucsc closed 5 years ago

hannes-ucsc commented 6 years ago

For which schema is a change/update being suggested?

I would like to request an update the the file_core.json schema.

What should the change/update be?

Add a field to express that a data file is "auxiliary" in the sense that it is not useful in isolation, but only in the presence of another file.

What new field(s) need to be changed/added?

Why is the change requested?

Analysis bundles will soon contain expression matrices in the form of a zarray store. These stores are essentially a collection of files, a main .zarray file and dozens of other files referenced in the .zarray file.

Each of these files will have to be described by metadata and linked to the analysis process that created them. That's the only way they can be placed together in a bundle. However, the data browser should only show a row for the .zarray file. Listing all files would clutter the UI significantly without gain in utility or expressiveness.

@tburdett @mckinsel @rexwangcc @dshiga @NoopDog

malloryfreeberg commented 6 years ago

@hannes-ucsc so I can understand the usage the better: is the intention that each analysis bundle submitted by an analysis pipeline to ingest will contain - in addition to the regular analysis files - a collection of 1 main zarray file and other files, and it is these (1 main zarray file and other files) files that will need to have the indication of being auxiliary or not?

Will the "main zarray file and other files" be submitted using the analysis_file.json schema? If so, I propose putting the proposed is_auxiliary field in the analysis_file.json schema. Unless there is a reason that the field might ever be needed for other types of files like sequence data files, image files, etc.

hewgreen commented 6 years ago

I'd go even more granular Mallory.

I think imaging datasets have this same usecase with some added complexity. A metadata field that essentially toggles indexing doesn't sound optimal. It would be preferable to enhance our file typing. I've perviously advocated recording both file format (we have this) and a file content label as an enum (proposed new field for file_core). These aux zarr files are not unknown and we should avoid overloading that term but we should be able to tag them. As an example we could say format is .zarray file_label is aux_zarr. Then the datastore could be more specific about not indexing these file types. This mechanism would also be helpful elsewhere in imaging and supplementary files.

malloryfreeberg commented 6 years ago

@rexwangcc @hannes-ucsc If we are to add this field to the analysis_file.json, how do you all want to handle current analysis files that get submitted in the secondary bundle? For example, the bam and bai and log files. Should they also have this field? What should it be set to? I'm not sure how the Browser currently decides which of those files to display. Looks like only the BAM secondary files are being shown? How is this encoded?

Although this ticket might be a good idea, we are currently in a metadata freeze except for bug fixes and metadata needed to ingest datasets. I would consider this a feature request but happy to discuss when this change might be able to be incorporated into the metadata release process.

rexwangcc commented 6 years ago

@malloryfreeberg I can leave my 2 cents here from Secondary-analysis perspective:

  1. Will the "main zarray file and other files" be submitted using the analysis_file.json schema?

I can show you what does the analysis_process look like now by masking out some sensitive info:

{
  "analysis_run_type": "run",
  "describedBy": "http://schema.integration.data.humancellatlas.org/type/process/analysis/8.0.3/analysis_process",
  "input_bundles": [
    "xxx"
  ],
  "inputs": [
    {
      "parameter_name": "fastq1",
      "parameter_value": "gs://org-hca-dss-checkout-integration/bundles/xxx-bundle-fqid/xxx_1.fastq.gz"
    },
    {
      "parameter_name": "fastq2",
      "parameter_value": "gs://org-hca-dss-checkout-integration/bundles/xxx-bundle-fqid/xxx_2.fastq.gz"
    },
    {
      "parameter_name": "sample_name",
      "parameter_value": "xxx"
    },
    {
      "parameter_name": "output_name",
      "parameter_value": "xxx"
    },
    {
      "parameter_name": "gtf_file",
      "parameter_value": "gs://bucket-name/reference/GRCh38_Gencode/gencode.v27.primary_assembly.annotation.gtf"
    },
    {
      "parameter_name": "genome_ref_fasta",
      "parameter_value": "gs://bucket-name/reference/GRCh38_Gencode/GRCh38.primary_assembly.genome.fa"
    },
    {
      "parameter_name": "rrna_intervals",
      "parameter_value": "gs://bucket-name/reference/GRCh38_Gencode/gencode.v27.rRNA.interval_list"
    },
    {
      "parameter_name": "gene_ref_flat",
      "parameter_value": "gs://bucket-name/reference/GRCh38_Gencode/GRCh38_gencode.v27.refFlat.txt"
    },
    {
      "parameter_name": "hisat2_ref_index",
      "parameter_value": "gs://bucket-name/reference/HISAT2/genome_snp_tran.tar.gz"
    },
    {
      "parameter_name": "hisat2_ref_trans_name",
      "parameter_value": "gencode_v27_trans_rsem"
    },
    {
      "parameter_name": "rsem_ref_index",
      "parameter_value": "gs://bucket-name/reference/GRCh38_Gencode/gencode_v27_primary.tar"
    },
    {
      "parameter_name": "hisat2_ref_name",
      "parameter_value": "genome_snp_tran"
    },
    {
      "parameter_name": "hisat2_ref_trans_name",
      "parameter_value": "gencode_v27_trans_rsem"
    },
    {
      "parameter_name": "stranded",
      "parameter_value": "NONE"
    }
  ],
  "outputs": [
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "txt",
        "file_name": "xxx_qc.bait_bias_summary_metrics.txt"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "txt",
        "file_name": "xxx_qc.insert_size_metrics.txt"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "txt",
        "file_name": "xxx_qc.quality_by_cycle_metrics.txt"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "txt",
        "file_name": "xxx_qc.quality_distribution_metrics.txt"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "txt",
        "file_name": "xxx_qc.rna_metrics.txt"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "csv",
        "file_name": "xxx_QCs.csv"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "csv",
        "file_name": "xxx_bait_bias_detail_metrics.csv"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "csv",
        "file_name": "xxx_base_distribution_by_cycle_metrics.csv"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "csv",
        "file_name": "xxx_error_summary_metrics.csv"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "csv",
        "file_name": "xxx_gc_bias.csv"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "csv",
        "file_name": "xxx_pre_adapter_detail_metrics.csv"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "csv",
        "file_name": "xxx_pre_adapter_summary_metrics.csv"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "bam",
        "file_name": "xxx_qc.bam"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "bai",
        "file_name": "xxx_qc.bam.bai"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "bam",
        "file_name": "xxx_rsem.bam"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "results",
        "file_name": "xxx_rsem.genes.results"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "results",
        "file_name": "xxx_rsem.isoforms.results"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "matrix",
        "file_name": "xxx.zarr!.zattrs"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!.zgroup"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!.zgroup"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!cell_id!.zarray"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!cell_id!0.0"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!expression!.zarray"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!expression!0.0"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!gene_id!.zarray"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!gene_id!0.0"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!qc_metric!.zarray"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!qc_metric!0.0"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!qc_values!.zarray"
      },
      "schema_type": "file"
    },
    {
      "describedBy": "http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file",
      "file_core": {
        "file_format": "unknown",
        "file_name": "xxx.zarr!expression_matrix!qc_values!0.0"
      },
      "schema_type": "file"
    }
  ],
  "process_core": {
    "process_id": "xxx"
  },
  "process_type": {
    "text": "analysis"
  },
  "reference_bundle": "xxx",
  "schema_type": "process",
  "tasks": [
    {}
  ],
  "timestamp_start_utc": "xxx",
  "timestamp_stop_utc": "xxx"
}

you can see that all of the "zarr-family" files are submitted under analysis file schema, while the main one xxx.zarr!.zattrs is marked as "file_format": "matrix" and other auxiliary files are marked as "file_format": "unknown". So yes, they will be submitted using the analysis_file.json schema.

  1. If we are to add this field to the analysis_file.json, how do you all want to handle current analysis files that get submitted in the secondary bundle?

I guess we can either treat all analysis files that don't have is_auxiliary flag(so by default this flag has False value) as main/standalone files, e.g. bam, csv. Or we set is_auxiliary to False to standalone files and set it to True auxiliary files (xxx.zarr!expression_matrix!qc_values!.zarray, xxx.zarr!expression_matrix!qc_metric!0.0...) These 2 approaches look very similar to us, since either way we have to set the flags for each file during creating the analysis_process filtering by their formats like below:

[{
    'describedBy': 'http://schema.integration.data.humancellatlas.org/type/file/5.3.4/analysis_file',
    'schema_type': 'file',
    'file_core': {
        'file_name': FIGURE_OUT_NAME(analysis_file),
        'file_format': FIGURE_OUT_FORMAT(analysis_file),
        'is_auxiliary': FIGURE_OUT_AUXILIARY_BY_FORMAT(analysis_file)
    }
} for analysis_file in analysis_files]
hannes-ucsc commented 6 years ago

so I can understand the usage the better: is the intention that each analysis bundle submitted by an analysis pipeline to ingest will contain - in addition to the regular analysis files - a collection of 1 main zarray file and other files, and it is these (1 main zarray file and other files) files that will need to have the indication of being auxiliary or not?

What @rexwangcc said.

Will the "main zarray file and other files" be submitted using the analysis_file.json schema? If so, I propose putting the proposed is_auxiliary field in the analysis_file.json schema. Unless there is a reason that the field might ever be needed for other types of files like sequence data files, image files, etc.

I think this might come in handy for imaging files, too.

At the very highest level, what is missing here is a way to express relationships between files. Orange and potentially others are in need of a mechanism to answer the following question: In order to process file A which other auxiliary files do I also need? Without this mechanism, only a heuristic can be used. In the example @rexwangcc gave, the heuristic would be to consider all files with the same prefix (xxx in that example). @tburdett mentioned that the relationship will be made evident by links.json which links those files to the same process. But this doesn't express the fact that I don't need the .bam when I'm interested only in the matrix files. Neither does that cover the case where there are two independent zarray stores in a single analysis bundle.

Maybe we should simply add more edges to the graph via links.json. Edges that express relationships directly between files. As in "file A is useless without file B."

hewgreen commented 6 years ago

the case where there are two independent zarray stores in a single analysis bundle

In the imaging case, spacetx own and control the metadata and linking of files. For spacetx format imaging data we've been told that pointing to their experiment.json (their highest level entity) is all the HCA metadata needs to do (we may mark the rest as auxiliary). But as Hannes points out, this would mean in a bundle with more than one experiment.json file you wouldn't know which aux files belonged to which experiment.json without looking in the experiment.json and finding out. DCP couldn't do this very reliably because we don't control the format. The knock on effect here would be that a user couldn't just download one experiment. Down the hierarchy they also have multiple FOV.json files which are manifests with pointers. At that granularity it is likely there will be more than one per bundle which gives us the same problem as Hannes mentions.

For matrix files we have more control of the format so could inspect the .zarr!.zattrs file but it would be nice to have some extra links so we didn't have to. Depending on consumer requirements we may need to do something similar for imaging but with the added worry that the format is not stable and third party. So a general mechanism would be great.

(I accept it would be fair to ignore the imaging problem for now)

malloryfreeberg commented 5 years ago

Discussed on Metadata call 19/11/2018. Decided to make new ticket with specific use cases (e.g. zarr*, bam/bai). Discuss with pipelines team. @hannes-ucsc

hannes-ucsc commented 5 years ago

Notes to self: BAI/BAM and imaging.

hannes-ucsc commented 5 years ago

Superseded by #623.