Closed adknaupp closed 3 months ago
When a new dataset needs staging, the process begins by examining the XML file pointed to by the path property of the JSON representation. This XML may contain the extent of the information needed to stage the dataset, unless it references other XML files. When a dataset XML doesn't reference any other dataset XMLs, then
When the primary XML cites another XML as an external resource, then this XML should be examined as well. While such XMLs may reference files that weren't identified by the primary XML, it is also possible that a file referenced by an "additional" XML is the same as one referenced in the primary XML. In any case, it would be necessary to prevent duplicate files from being staged, unless maybe the duplicates ended up in different folders, avoiding a filename conflict error.
An alternative method of identifying dataset files is by using they pbmeta:CollectionPathUri, which identifies a directory where the "collection"'s files live. This directory should contain files uploaded from an instrument (collection is used as a term for a set of sequencing wells from a single SMRT cell).
Background on dataset representation
XML and JSON representations
SMRT Link's primary API for defining a "dataset" is the dataset XML. SMRT Link imports a small subset of the data from the XML into its database, which is available in a JSON representation via HTTP. The data found in the JSON representation is very limited, but it does include a path to the XML file.
XML contents
A dataset XML files consists of three major sections:
External Resources
The external resources section will always contain the main BAM file. However, it will often contain multiple BAM files, like in the case of an "all samples" dataset resulting from a barcoded run.
Supplemental Resources
The supplemental resources section is optional, but it can contain all sorts of files associated with a dataset. In the case of the files uploaded by the Revio after a run, many, but not all are referenced in the supplemental resources section of the XML(s).
Dataset Metadata
The dataset metadata section does not usually reference any files, rather it contains metadata about the run which generated the reads. The JSON representation stored in SMRT Link's database includes some of the more important datapoints such as barcode name, the number records (e.g. reads), and more.