4dn-dcic / fourfront

Data portal for submitting and viewing genomic data
https://data.4dnucleome.org
MIT License
13 stars 2 forks source link

Home Page QuickInfoBar Links #1886

Closed utku-ozturk closed 5 months ago

utku-ozturk commented 6 months ago

Trello: https://trello.com/c/JngcASPz

utku-ozturk commented 6 months ago

Home page count links.

RahiNav commented 6 months ago

Looks good! For experiments link, the first column also includes title e.g. "multiplexed FISH on ES-E14TG2a with Rad21-AID - 4DNEXRINUWBN". Is it better to have just the accession "4DNEXRINUWBN", similar to other browser pages?

aschroed commented 6 months ago

In re Rahi's comment - the first column is the title that is returned by default as the first column for all searches and I think it is fine to leave as it is - no change needed there.

aschroed commented 6 months ago

Utku - I tested the queries you provided above on data and it looks like the Experiment query does return the same number as shown for the count at the top of the home page - if you see a discrepancy can you provide more info so I can look into it.

For the File search - can you adjust the subquery that you were showing at the meeting this morning to include the 'other_processed_files' linked to experiments and replicate sets as these should be included in the counts in any case and see if there is still a difference in the search and aggregated counts? Thanks.

utku-ozturk commented 6 months ago

@aschroed Thanks for the feedback. FYI, we adjusted the ES query to include ExpSet and Exp's OPF as below:

    "total_expset_other_processed_files" : {
        "cardinality" : {
            "field" : "embedded.other_processed_files.files.accession.raw",
            "precision_threshold" : 10000
        }
    },
    "total_exp_other_processed_files" : {
        "cardinality" : {
            "field" : "embedded.experiments_in_set.other_processed_files.files.accession.raw",
            "precision_threshold" : 10000
        }
    }

Now the total files count in home page jumped from 39721 to 46051, that exceeds the /search result count - 40932. We are still investigating other alternatives to eliminate any possible duplicates.

utku-ozturk commented 6 months ago

This looks good to me. Did Will have a look at the queries to ensure the performance is OK or did you change the approach so that is no longer a concern?

@aschroed We initially attempted to split the ES query into three sub-queries (raw, processed, and OPF). But the memory impact would be a big concern, and counts are still approximations. Then, we reverted all of them and added new aggregations to the existing query, which looks sufficient. The first approach definitely would require Will's feedback, but the current approach does not, in my opinion.