DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Orphaned files in AnVIL snapshots #4654

Closed hannes-ucsc closed 10 months ago

hannes-ucsc commented 2 years ago

We still aren't capturing all the files in the snapshots. The percentage of the files captured is highly variable from one snapshot to the next. Stats for anvildev following the reindex for this PR:

Snapshot ID Snapshot name Number of files in snapshot Number of files in index %
666da204-4425-4226-a720-2d2be08978f1 ANVIL_CMG_UWASH_DS_BDIS_20220922_ANV3_202210141402 20 20 100%
6e67d7dc-7864-4bbd-9c08-a87d37205ed8 ANVIL_CMG_UWASH_GRU_20220919_ANV3_202210141318 4102 1198 29%
960d68a3-c91f-4919-a04a-f16c52a62e54 ANVIL_CMG_UWASH_HMB_IRB_20220921_ANV3_202210141341 86 82 95%
9bea8775-a44e-46d0-8e4f-13519ced7d73 ANVIL_CMG_UWASH_DS_HFA_20220922_ANV3_202210141350 281 166 59%
cce13a1f-ceff-4bfd-a16b-2769ec86546d ANVIL_CMG_UWASH_GRU_IRB_20220921_ANV3_202210141354 654 650 99%
d85f61c0-5a62-4101-9292-d44f7317c45f ANVIL_CMG_UWASH_HMB_20220921_ANV3_202210141346 809 805 96%
dc40e347-5666-4cff-b7ad-3f00609a3d97 ANVIL_CMG_UWASH_DS_BAV_IRB_PUB_RD_20220921_ANV3_202210141329 358 354 99%
e22678a8-5660-4695-ad15-6254b584c17a ANVIL_CMG_UWASH_DS_EP_20220921_ANV3_202210141335 102 98 96%
f1dabaa7-a6d8-4017-9427-d7bc6321fdb1 ANVIL_GTEx_V8_hg38_20221005_ANV3_202210141405 123000 41832 34%
f462bebe-0910-4453-9b57-85fb01d267dc ANVIL_CMG_UWASH_DS_NBIA_20220922_ANV3_202210141358 171 168 98%

Originally posted by @noah-aviel-dove in https://github.com/DataBiosphere/azul/issues/4613#issuecomment-1290108230

hannes-ucsc commented 2 years ago

Would be good to have the snapshot names in the above table.

theathorn commented 2 years ago

@theathorn to follow up with Nate to see if the above numbers are exactly what he would expect - unless he's absolutely certain, Azul team needs to do more investigation.

nadove-ucsc commented 2 years ago

Added snapshot names

ncalvanese1 commented 2 years ago

The number of orphaned files here is not surprising to me (in the sense that I would expect there to be files included in the snapshot that are connected only to the file entity and not connected to any of the other entities). I'm surprised that they wouldn't be included in the index, however, and I assume this is because the entry point for the index is the biosample entity. Is that correct?

In general, the way AnVIL datasets are structured today is that they contain a set of data file objects and a set of tabular data. There is no requirement, however, that the data file objects are always referenced in the tabular data. This means in the TDR snapshots we create, there can be files in the file entity that have no associated activity, biosample, or donor entry, and I expect we would still want these to be discoverable by users.

hannes-ucsc commented 2 years ago

I assume this is because the entry point for the index is the biosample entity. Is that correct?

That is correct.

hannes-ucsc commented 2 years ago

We discussed this during stand-up today. We agreed that

mbaumann-broad commented 2 years ago

Reposting from Slack #ucsc-anvil-explorer-collab :

Following up on yesterday's discussion regarding only a small fraction (35% overall) of the files that are in released AnVIL data workspaces being indexed and populated into user workspaces ...

A primary objective of the AnVIL program is to enable/support researchers successfully completing their analysis within AnVIL in a timely manner. We must remove significant impediments to this wherever possible. Ensuring that researchers can easily and reliably get all the files in the released AnVIL workspace into their personal analysis workspace is critical.

At a minimum, I think we must ensure that when a researcher selects only the Dataset facet, all of the files from the released AnVIL data workspaces are made available in the researcher's workspace. (I realize that when a researcher starts to subselect within a dataset, the file handoff may become lossy until all the correct file references are in place - which will require an extensive period of time).

If we don't do at least that much, then I think we may see AnVIL researchers continuing to clone the AnVIL data workspaces and use those for their analysis. This is a behavior we want to move away from and does not work at all for inter-system interoperability.

I also think that where we can easily and reliably add missing references, such as associating bam/cram index files, we should. This could be done either in the AnVIL data workspaces prior to TDR ingestion (preferable?), or during TDR ingestion or mapping.

I understand and support the objective of getting the AnVIL data into a highly cohesive state with all the correct references in place. The AnVIL data is far from that today, and getting there will require motivation, policy, and mechanisms for the submitters/curators. We must not, however, penalize/impede researchers using the system today by limiting their file access while we work towards this desired state.

Regarding the sponsor demo, it would help build confidence if all the files were populated into a user's analysis workspace. I don't think this is a strict requirement for the first demo, although I do believe we at least need a clear plan that we present when asked.

hannes-ucsc commented 1 year ago

This was discussed again on 1/10/23. After the Broad addressed the low-hanging fruit, like linking bams to fastqs, there are still some orphans left. The Broad agreed to provide concrete examples of the remaining orphaned files. UCSC expressed their preference to have these files explicitly described akin to supplementary_files in HCA.

Our preference is based on

1) ease-of use principle (snapshot readers, be that UCSC or anyone else, should not have to write elaborate queries to discover orphaned files),

2) performance (queries for orphans are expensive: they need to examine all rows in all tables that have foreign keys to the files table) and

3) implementation effort (it would be significant work for us to implement the discovery of orphaned files).

Even if we go with the supplementary file solution, UCSC would still have to shoulder some of the implementation burden since supplementary files won't naturally align to subgraphs, at least not the way we currently define them (biosamples). I think this solution is a fair distribution of labor and it is good for other consumers of AnVIL snapshots.

hannes-ucsc commented 1 year ago

This was discussed again on 2/22/23. No examples of additional orphaned files were provided to date. We seemed to coalesce around marking files with a flag in an addition column in the anvil_file table. @ncalvanese1 agreed to explore this option, involving @hannes-ucsc if/when needed.

ncalvanese1 commented 1 year ago

@hannes-ucsc -- A couple of questions/thoughts on this one:

  1. From a practical standpoint, we should consider an "orphaned file" to be one that can't be connected back to a anvil_biosample record, correct? I believe that's the entry point for the indexing, but I just wanted to confirm, since we do have some cases where an anvil_file record could connect to another anvil_file record via anvil_activity, but never actually join back to anvil_biosample.
  2. I can't think of a better term to use for flagging these files other than "supplementary" or "additional" files, so I think I'm just going to stick with the "supplementary" term and create a supplementary_file_flag boolean field in the anvil_file table that would be populated with True/False depending on the status of the file. If that works for you, I'll get a PR up in the anvil_tdr_ingest repo.
hannes-ucsc commented 1 year ago

From a practical standpoint, we should consider an "orphaned file" to be one that can't be connected back to a anvil_biosample record, correct?

I think so, but good question. If we have many files that can be connected to some other biomaterial or organism, like donor or library, we'd be open to considering changing the subgraph definition to center around that type of entity. Currently there is a subgraph per biosample.

we do have some cases where an anvil_file record could connect to another anvil_file record via anvil_activity, but never actually join back to anvil_biosample

I would consider those supplementary as well, but maybe transitively. Let's say file B is derived from file A using an activity, but file A is not derived from anything. In that case you could mark files A and B as supplementary. If it's easier not to mark B that would be OK too, and we can add a query to look for files derived from A, and files derived from files derived from A and so on. It is relatively easy for us to look for things that are referenced, but prohibitively expensive to look for things that aren't.

hannes-ucsc commented 1 year ago

If that works for you, I'll get a PR up in the anvil_tdr_ingest repo.

It does. Thank you.

hannes-ucsc commented 1 year ago

Schema PR is here: https://github.com/broadinstitute/anvil_tdr_ingest/pull/3

hannes-ucsc commented 1 year ago

Schema PR was merged and updated snapshots were released. The work to switch to those snapshots is tracked in #5014. Adding support for the new is_supplementary column is tracked in #5000.

hannes-ucsc commented 1 year ago

We found a bug in the 1000G snapshot in TDR dev (currently indexed in anvilprod). An updated snapshot was released and we are currently working on incorporating it and removing our workaround for the bug.

hannes-ucsc commented 10 months ago

The supplementary file schema change and our support for it are fully implemented. The 1000G bug mentioned in the previous comment was addressed and our workaround for it was removed.