byu-dnasc / proto-smrtlink-share

0 stars 1 forks source link

Explain how XML "Resource" elements identify dataset files #18

Closed adknaupp closed 3 months ago

adknaupp commented 3 months ago

Background on dataset representation

XML and JSON representations

SMRT Link's primary API for defining a "dataset" is the dataset XML. SMRT Link imports a small subset of the data from the XML into its database, which is available in a JSON representation via HTTP. The data found in the JSON representation is very limited, but it does include a path to the XML file.

XML contents

A dataset XML files consists of three major sections:

External Resources

The external resources section will always contain the main BAM file. However, it will often contain multiple BAM files, like in the case of an "all samples" dataset resulting from a barcoded run.

Supplemental Resources

The supplemental resources section is optional, but it can contain all sorts of files associated with a dataset. In the case of the files uploaded by the Revio after a run, many, but not all are referenced in the supplemental resources section of the XML(s).

Dataset Metadata

The dataset metadata section does not usually reference any files, rather it contains metadata about the run which generated the reads. The JSON representation stored in SMRT Link's database includes some of the more important datapoints such as barcode name, the number records (e.g. reads), and more.

adknaupp commented 3 months ago

Dataset XML ingestion

Primary XML processing

When a new dataset needs staging, the process begins by examining the XML file pointed to by the path property of the JSON representation. This XML may contain the extent of the information needed to stage the dataset, unless it references other XML files. When a dataset XML doesn't reference any other dataset XMLs, then

Additional XMLs and redundant files

When the primary XML cites another XML as an external resource, then this XML should be examined as well. While such XMLs may reference files that weren't identified by the primary XML, it is also possible that a file referenced by an "additional" XML is the same as one referenced in the primary XML. In any case, it would be necessary to prevent duplicate files from being staged, unless maybe the duplicates ended up in different folders, avoiding a filename conflict error.

Recursive resource search and resolving redundancy

adknaupp commented 3 months ago

Rationalization of the reliance on dataset XML "Resources" for the identification of files associated with the dataset

  1. Projects consist of datasets.

Alternative - pbmeta:CollectionPathUri

An alternative method of identifying dataset files is by using they pbmeta:CollectionPathUri, which identifies a directory where the "collection"'s files live. This directory should contain files uploaded from an instrument (collection is used as a term for a set of sequencing wells from a single SMRT cell).