Closed hannes-ucsc closed 10 months ago
Would be good to have the snapshot names in the above table.
@theathorn to follow up with Nate to see if the above numbers are exactly what he would expect - unless he's absolutely certain, Azul team needs to do more investigation.
Added snapshot names
The number of orphaned files here is not surprising to me (in the sense that I would expect there to be files included in the snapshot that are connected only to the file entity and not connected to any of the other entities). I'm surprised that they wouldn't be included in the index, however, and I assume this is because the entry point for the index is the biosample entity. Is that correct?
In general, the way AnVIL datasets are structured today is that they contain a set of data file objects and a set of tabular data. There is no requirement, however, that the data file objects are always referenced in the tabular data. This means in the TDR snapshots we create, there can be files in the file entity that have no associated activity, biosample, or donor entry, and I expect we would still want these to be discoverable by users.
I assume this is because the entry point for the index is the biosample entity. Is that correct?
That is correct.
We discussed this during stand-up today. We agreed that
Reposting from Slack #ucsc-anvil-explorer-collab :
Following up on yesterday's discussion regarding only a small fraction (35% overall) of the files that are in released AnVIL data workspaces being indexed and populated into user workspaces ...
A primary objective of the AnVIL program is to enable/support researchers successfully completing their analysis within AnVIL in a timely manner. We must remove significant impediments to this wherever possible. Ensuring that researchers can easily and reliably get all the files in the released AnVIL workspace into their personal analysis workspace is critical.
At a minimum, I think we must ensure that when a researcher selects only the Dataset facet, all of the files from the released AnVIL data workspaces are made available in the researcher's workspace. (I realize that when a researcher starts to subselect within a dataset, the file handoff may become lossy until all the correct file references are in place - which will require an extensive period of time).
If we don't do at least that much, then I think we may see AnVIL researchers continuing to clone the AnVIL data workspaces and use those for their analysis. This is a behavior we want to move away from and does not work at all for inter-system interoperability.
I also think that where we can easily and reliably add missing references, such as associating bam/cram index files, we should. This could be done either in the AnVIL data workspaces prior to TDR ingestion (preferable?), or during TDR ingestion or mapping.
I understand and support the objective of getting the AnVIL data into a highly cohesive state with all the correct references in place. The AnVIL data is far from that today, and getting there will require motivation, policy, and mechanisms for the submitters/curators. We must not, however, penalize/impede researchers using the system today by limiting their file access while we work towards this desired state.
Regarding the sponsor demo, it would help build confidence if all the files were populated into a user's analysis workspace. I don't think this is a strict requirement for the first demo, although I do believe we at least need a clear plan that we present when asked.
This was discussed again on 1/10/23. After the Broad addressed the low-hanging fruit, like linking bams to fastqs, there are still some orphans left. The Broad agreed to provide concrete examples of the remaining orphaned files. UCSC expressed their preference to have these files explicitly described akin to supplementary_files
in HCA.
Our preference is based on
1) ease-of use principle (snapshot readers, be that UCSC or anyone else, should not have to write elaborate queries to discover orphaned files),
2) performance (queries for orphans are expensive: they need to examine all rows in all tables that have foreign keys to the files table) and
3) implementation effort (it would be significant work for us to implement the discovery of orphaned files).
Even if we go with the supplementary file solution, UCSC would still have to shoulder some of the implementation burden since supplementary files won't naturally align to subgraphs, at least not the way we currently define them (biosamples). I think this solution is a fair distribution of labor and it is good for other consumers of AnVIL snapshots.
This was discussed again on 2/22/23. No examples of additional orphaned files were provided to date. We seemed to coalesce around marking files with a flag in an addition column in the anvil_file
table. @ncalvanese1 agreed to explore this option, involving @hannes-ucsc if/when needed.
@hannes-ucsc -- A couple of questions/thoughts on this one:
From a practical standpoint, we should consider an "orphaned file" to be one that can't be connected back to a anvil_biosample record, correct?
I think so, but good question. If we have many files that can be connected to some other biomaterial or organism, like donor or library, we'd be open to considering changing the subgraph definition to center around that type of entity. Currently there is a subgraph per biosample.
we do have some cases where an anvil_file record could connect to another anvil_file record via anvil_activity, but never actually join back to anvil_biosample
I would consider those supplementary as well, but maybe transitively. Let's say file B is derived from file A using an activity, but file A is not derived from anything. In that case you could mark files A and B as supplementary. If it's easier not to mark B that would be OK too, and we can add a query to look for files derived from A, and files derived from files derived from A and so on. It is relatively easy for us to look for things that are referenced, but prohibitively expensive to look for things that aren't.
If that works for you, I'll get a PR up in the anvil_tdr_ingest repo.
It does. Thank you.
Schema PR is here: https://github.com/broadinstitute/anvil_tdr_ingest/pull/3
Schema PR was merged and updated snapshots were released. The work to switch to those snapshots is tracked in #5014. Adding support for the new is_supplementary
column is tracked in #5000.
The supplementary file schema change and our support for it are fully implemented. The 1000G bug mentioned in the previous comment was addressed and our workaround for it was removed.
Originally posted by @noah-aviel-dove in https://github.com/DataBiosphere/azul/issues/4613#issuecomment-1290108230