Closed nutjob4life closed 2 years ago
@asitang here's what the RDF generator does to get all collections:
solr = Solr(context.labcasSolrURL + '/collections', auth=(context.username, context.password))
results = solr.search(q='*:*', rows=999999) # 😮 TODO This'll fail once we get to a million collections
What should the q
be to get just public collections only?
@nutjob4life this is a good question for @yuliujpl as he already uses some logic (probably using the "OwnerPrincipal" field)!
@yuliujpl if you have a moment, could you look at the above ↑ What should q
be?
@nutjob4life, I would use "QAState":"Public"! However, not all collections have QAState on them so @asitang @hoodriverheather we may need to make sure all collections have QAState? Thanks!
@nutjob4life @yuliujpl We stopped using QAState and now only use the OwnerPrincipal.
OwnerPrincipal=cn=All Users,dc=edrn,dc=jpl,dc=nasa,dc=gov
@hoodriverheather on our status meeting on 2022-04-26, I heard that we wanted https://edrn.nci.nih.gov/data-and-resources/data to have public data only. (This'll require just an update to the LabCAS RDF.)
However, on 2022-05-03, I think I heard a different requirement: that https://edrn.nci.nih.gov/data-and-resources/data should show public data unless you're logged in and have permission to view additional collections. Is that correct? (This'll require both updates to LabCAS RDF and to the Public Portal—which is fine, since we're updating the Public Portal anyway).
Let me know which way to proceed. The former is quicker but the latter is nicer 😇
Incidentally, I looked at the collections in edrn-labcas
and found the following:
Collection | Num Owners | Num QA States |
---|---|---|
Analysis_of_pancreatic_cancer_biomarkers_in_PLCO_set | 1 | 0 |
Autoantibody_Biomarkers | 1 | 0 |
Automated_Quantitative_Measures_of_Breast_Density_Data | 1 | 0 |
Automated_System_For_Breast_Cancer_Biomarker_Analysis | 1 | 0 |
Barrett's_Esophagus_Methylation_Profile_Dataset | 3 | 0 |
Basophile | 1 | 0 |
BBD_Pathology_Slide_Images | 4 | 0 |
Canary_Never-smoker_lung_adenocarcinoma | 8 | 0 |
Combined_Imaging_and_Blood_Biomarkers_for_Breast_Cancer_Diagnosis | 4 | 0 |
DCIS_Pathology_Slide_Images | 4 | 0 |
Duke_University_Breast_Data | 3 | 0 |
EDRN_Prostate_Data_University_of_Washington | 1 | 0 |
EDRN_WHI_Colon | 1 | 1 |
EVMS_Mass_Spec_Data | 3 | 0 |
FHCRC_MALDI_Dilution_Processed_Data | 2 | 0 |
GSTP1_Methylation | 1 | 0 |
Lung_Team_Project_2 | 16 | 0 |
Moffitt_Holgic_Dimensions_3D_Case-Control_Mammography_Study | 1 | 0 |
Multiplex_IF_Staining_Pancreatic_Cancer | 1 | 0 |
nanoString_multi-marker_RNA_digital_counts | 1 | 0 |
NIST_Fish_Data | 3 | 0 |
PLCO_Phase_III_Dataset | 1 | 0 |
Pre-PLCO_Phase_II_Dataset | 1 | 0 |
Prostate_MRI | 3 | 0 |
"Prostate_pre-validation_for_hk2, _hk4_and_hk11." | 3 | 0 |
Reproducibility_of_miRNA_Measurements | 1 | 1 |
Retrospective_Images_and_Blood_Duke | 3 | 0 |
Retrospective_Images_and_Blood_Moffitt | 3 | 0 |
SELDI_PhaseII | 1 | 0 |
Transcriptomes_of_human_bladder_cells_and_cells_in_bladder_cancer | 1 | 0 |
TSP_Pre-validation_using_Prostate_Rapid_Pre-Validation_Set. | 3 | 0 |
University_of_Pittsburg_Ovarian_Data | 3 | 0 |
University_of_Pittsburg_Pancreatic_Data | 3 | 0 |
University_of_Washington_Immunohistochemistry_Data | 1 | 0 |
University_of_Washington_Microarray_Data | 1 | 0 |
So it definitely looks like OwnerPrincipal
is the one to look at 😉
However for FHCRC_MALDI_Dilution_Processed_Data
there seems to be an incorrect value for OwnerPrincipal
; it has two values:
cn=National Cancer Institute,dc=edrn,dc=jpl,dc=nasa,dc=gov
✅OwnerPrincipal=cn=Feng Fred Hutchinson Cancer Research Center,dc=edrn,dc=jpl,dc=nasa,dc=gov
❌I've submitted a separate issue for that.
bae6b00bb0733daec927f08b989d62126e3a4f16 goes ahead and adds owner principal to the RDF so the portal will have to filter. Closing this.
As the title says: LabCAS RDF should have public data only. Currently it has everything.