HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
65 stars 32 forks source link

Update imaging_targets.json to include a field for SAM file in place of probe seq #587

Open dosumis opened 5 years ago

dosumis commented 5 years ago

For which schema is a change/update being suggested?

I would like to request an update the the imaging_targets.json schema.

What should the change/update be?

The current schema has a field for probe sequence:

"probe_sequence": {
            "pattern": "[ATGCUatgcu]+",
            "description": "Sequence of a probe used to detect target.",
            "type": "string",
            "user_friendly": "Probe sequence"
        },

We suggest replacing this with a

"probe_sequence_sam_file": {
            "description": "The name of a SAM file (https://en.wikipedia.org/wiki/SAM_(file_format)) that describes the probe.  While this field is not compulsory, its use is strongly encouraged for all cases where the relevant information is available.",
            "type": "string",
            "user_friendly": "Probe sequence (SAM) file."
        },

This will NOT be a required field.

Why is the change requested?

Probe sequence is not something that needs to be indexed. Rather, it is information that may be used in re-analysis. As such it probably doesn't belong in the metadata schema, but instead should live in a standard exchange format for describing probes. We think that SAM file format (https://en.wikipedia.org/wiki/SAM_(file_format)) is a good choice for this - although we would welcome feedback and suggestions for alternative. We propose that there should be one SAM file per probe, allowing a link to the relevant SAM file to be kept in targets.json - connected with information about the target (e.g. target gene name & accession). The same standard can be used if we have to ingest experiments with specified probe sets that do not produce imaging data.

To this end we propose that probe sequences and related information should be stores in files in [ and the probe_sequence field be replaced by a new field for storing the filename of a SAM file describing the probe.

The new field will not be required as:

(a) Making this a required field would need some advanced JSON schema spec (probably with further nesting) to deal with the fact that this field is only relevant where the targeting reagent is a sequence probe. (b) There are circumstances where it will not be possible to provide this (e.g. a commercially ordered probes for some specified gene target with obfuscated probe seq). We should however strongly encourage that this information be provided if available.

CC @lauraclarke @zperova @ambrosejcarr @hewgreen

dosumis commented 5 years ago

Current status: Awaiting feedback before making pull request.

lauraclarke commented 5 years ago

@dosumis Can you write up the use case for this file and I will send it to the GA4GH large-scale genomics file formats working group to see if they have a view. I am wondering if an indexed fasta file would be simpler for people to generate than a SAM file but it would be good to see what the GA4GH file formats group think.

lauraclarke commented 5 years ago

The other people to ask would be ArrayExpress as I assume these are somewhat equivalent to microarray probes and see if there is already any sort of standard in that space

dosumis commented 5 years ago

Why SAM files?

PRO

  1. They include a range of fields that could be useful in re-mapping
  2. They are widely used: "The format is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project." [WP]
  3. Standard tooling is available and appears to be widely used http://biobits.org/samtools_primer.html

CON

Alignment sections have 11 mandatory fields. This may be asking a bit much.

Request for further input

This is definitely outside of my area of expertise. Feedback needed

@malloryfreeberg - anything to add?

The other people to ask would be ArrayExpress as I assume these are somewhat equivalent to microarray probes and see if there is already any sort of standard in that space

Good idea. I'll ping them

lauraclarke commented 5 years ago

@dosumis I know what a SAM file is and how they work. I am trying to figure out what use case you are aiming to meet when providing this information?

Is this to provide genomic locations for the aligned probes? or is this about providing the probe sequences but the location isn't important?

If the first one is your aim then SAM (well BAM) is the best format. The the second is then sequences and identity of the probe then fasta might be better as the effort involved is a lot lower.

It is worth noting that SAM/BAM can meet the need of unaligned probes so it both are required then SAM/BAM is likely to better

Another important question is that are there any existing tools which expect this format? The format they use will also need to be considered and if it isn't SAM/BAM how easy it is to convert between SAM/BAM and this other format is important

dosumis commented 5 years ago

The main use case I have in mind is in (re)-mapping to transcripts based on some new release of genome annotation or using some reference transcriptome.

I'll leave it to others to say whether FASTA is always sufficient for this or if SAM would be better.

(I've asked Irene's group for comments).

hewgreen commented 5 years ago

I second this suggestion but just want to add that this field should call a supplementary file which can be multiple formats. This gives us some flexibility. For example, we could suggest a preferred format in the description. I'm not sure if we can currently validate the specific format of supplementary files.

dosumis commented 5 years ago

I second this suggestion but just want to add that this field should call a supplementary file which can be multiple formats

Wouldn't it be better to insist on a uniform standard? Leaving it completely open sounds like a nightmare for downstream users. How can you build an analysis pipeline if you have no idea what format files you have to work with? If SAM is too high a barrier, I guess having two options (FASTA or SAM) might be OK as long as it is easy for consumers of data & metadata to tell which is which programatically, but I'd like to hear from some bio-informaticians with experience in this.

lauraclarke commented 5 years ago

I agree with @dosumis If this file has a specific use case and works with specific downstream tools it should be in a standard format which works in that context.

I agree that we shouldn't limit ourselves to a rigid set of file types for all supplementary files but a specific file type meeting a specific need should be in a specific format

hewgreen commented 5 years ago

Sure it would be great but we need the validators to accomplish this. Taking this information out of the schema and putting it into a supplementary file has wider validation implications anyway especially if we need to maintain any mapping to the codebook or HCA metadata. So insisting on a standard format is probably a relatively minor concern. @dosumis You have a better idea than me about the automation and throughput required for this remapping.

dosumis commented 5 years ago

Sure it would be great but we need the validators to accomplish this.

Doesn't sound like an insurmountable barrier to me given that there are straightforward, open source options for this. And I don't see any great harm in ingesting initial SpaceTx datasets with this spec before we have validation built into the DCP.

especially if we need to maintain any mapping to the codebook or HCA metadata

The suggested spec has one file per probe, with the filename being directly attached to the target in targets.json, so there are no additional mapping issues. Mapping to the codebook would work as now - using the target name as a key.

lauraclarke commented 5 years ago

I think it is important to distinguish between arbitrary supplementary files which are be collected because they might be useful downstream but it is unclear what that use case is and specific files which are being collected because there is a specific requirement.

In this case, I think we are in the latter category. We should review and decide how long it will take to figure out. It might be we declare in the first instance that the format should be BAM (I would recommend that over plain SAM files) but don't validate in the first instance

What I am still missing is a description of the use cases and any requirements from the tools which we expect to use these files and these need to be clear before we make a firm decision.

If collecting enough information to make this decision correctly is going to take too long then using our current solution may be the best option.

zperova commented 5 years ago

There is a discussion on SAM file and info it should contain in the wg3-probes channel on SpaceTx slack. In short:

lauraclarke commented 5 years ago

It would be great to hear from both @ambrosejcarr and @joshmoore on this to see if there are any angles we have missed.

From a sequencing experiment perspective, I am drawing an analogy to and basing my thoughts on microarray experiments and how probe declarations and mapping works for them.

hewgreen commented 5 years ago

Some offline conversation highlights:

dosumis commented 5 years ago

@dosumis Could you clarify what you said offline. Do you propose that the field probe_sequence_sam_file at the level of target links to a SAM file with multiple probes or a specific probe in that file? If so are we saying to use primary keys to do the mapping?

The spec above links each target to a SAM file with probe(s) used as reagents to detect that target only. There can potentially be multiple probes in the SAM file. There is no key mapping involved.

My understanding of the discussion yesterday is that Ambrose proposes a single probes.sam file with key-linking from the codebook.

malloryfreeberg commented 5 years ago

@hewgreen and @zperova to ask for example probe sets to decide how best to proceed with the solution for this issue.

dosumis commented 5 years ago

spacetx validate the format but not the content of SAM and codebook files but are looking at how they can achieve this.

But they can and will validate the integrity of the combination of the two files by checking all keys match.

SAM was suggested because these files will be very small so binarising them isn't super useful but this is not a major concern either way from Ambrose.

I share Ambrose's preference is for plain text files unless here is a pressing reason to avoid. For a start it would make integrity checks much easier.

lauraclarke commented 5 years ago

@dosumis @ambrosejcarr the reason to use BAM rather than SAM is that no one uses plain text SAM files, they always use indexed BAM files as they are smaller and easier to operate over and there seems no good reason not to follow that standard here. I appreciate you can't open a BAM file directly and look at its contents but samtools view gives you basically the same function with very little extra effort and the random access and validation of the format which comes for free and runs much better on BAM files than SAM files means I don't see why you wouldn't use a BAM file instead

If we are going to pick a standard bioinformatics format we should also make sure we use the standard and well-supported libraries which come with it

http://www.htslib.org/download/

There is also a python wrapper https://pysam.readthedocs.io/en/latest/index.html but that doesn't seem to have quite the same support.

If the intent is to write specialised code from the ground up to deal with these files I would suggest design a format that meets our precise needs rather needs any compromises in order to work within an existing format.

Do we understand what programs will need to read this file and if SAM/BAM is a format they can already read?