EDRN / CancerDataExpo

Buildout for the EDRN backend data application server we affectionately call the CancerDataExpo
https://edrn.jpl.nasa.gov/cancerdataexpo
Apache License 2.0
0 stars 0 forks source link

LabCAS RDF should have public data only #15

Closed nutjob4life closed 2 years ago

nutjob4life commented 2 years ago

As the title says: LabCAS RDF should have public data only. Currently it has everything.

nutjob4life commented 2 years ago

@asitang here's what the RDF generator does to get all collections:

solr = Solr(context.labcasSolrURL + '/collections', auth=(context.username, context.password))
results = solr.search(q='*:*', rows=999999)  # 😮 TODO This'll fail once we get to a million collections

What should the q be to get just public collections only?

asitang commented 2 years ago

@nutjob4life this is a good question for @yuliujpl as he already uses some logic (probably using the "OwnerPrincipal" field)!

nutjob4life commented 2 years ago

@yuliujpl if you have a moment, could you look at the above ↑ What should q be?

yuliujpl commented 2 years ago

@nutjob4life, I would use "QAState":"Public"! However, not all collections have QAState on them so @asitang @hoodriverheather we may need to make sure all collections have QAState? Thanks!

hoodriverheather commented 2 years ago

@nutjob4life @yuliujpl We stopped using QAState and now only use the OwnerPrincipal.

OwnerPrincipal=cn=All Users,dc=edrn,dc=jpl,dc=nasa,dc=gov

nutjob4life commented 2 years ago

@hoodriverheather on our status meeting on 2022-04-26, I heard that we wanted https://edrn.nci.nih.gov/data-and-resources/data to have public data only. (This'll require just an update to the LabCAS RDF.)

However, on 2022-05-03, I think I heard a different requirement: that https://edrn.nci.nih.gov/data-and-resources/data should show public data unless you're logged in and have permission to view additional collections. Is that correct? (This'll require both updates to LabCAS RDF and to the Public Portal—which is fine, since we're updating the Public Portal anyway).

Let me know which way to proceed. The former is quicker but the latter is nicer 😇

Incidentally, I looked at the collections in edrn-labcas and found the following:

Collection Num Owners Num QA States
Analysis_of_pancreatic_cancer_biomarkers_in_PLCO_set 1 0
Autoantibody_Biomarkers 1 0
Automated_Quantitative_Measures_of_Breast_Density_Data 1 0
Automated_System_For_Breast_Cancer_Biomarker_Analysis 1 0
Barrett's_Esophagus_Methylation_Profile_Dataset 3 0
Basophile 1 0
BBD_Pathology_Slide_Images 4 0
Canary_Never-smoker_lung_adenocarcinoma 8 0
Combined_Imaging_and_Blood_Biomarkers_for_Breast_Cancer_Diagnosis 4 0
DCIS_Pathology_Slide_Images 4 0
Duke_University_Breast_Data 3 0
EDRN_Prostate_Data_University_of_Washington 1 0
EDRN_WHI_Colon 1 1
EVMS_Mass_Spec_Data 3 0
FHCRC_MALDI_Dilution_Processed_Data 2 0
GSTP1_Methylation 1 0
Lung_Team_Project_2 16 0
Moffitt_Holgic_Dimensions_3D_Case-Control_Mammography_Study 1 0
Multiplex_IF_Staining_Pancreatic_Cancer 1 0
nanoString_multi-marker_RNA_digital_counts 1 0
NIST_Fish_Data 3 0
PLCO_Phase_III_Dataset 1 0
Pre-PLCO_Phase_II_Dataset 1 0
Prostate_MRI 3 0
"Prostate_pre-validation_for_hk2, _hk4_and_hk11." 3 0
Reproducibility_of_miRNA_Measurements 1 1
Retrospective_Images_and_Blood_Duke 3 0
Retrospective_Images_and_Blood_Moffitt 3 0
SELDI_PhaseII 1 0
Transcriptomes_of_human_bladder_cells_and_cells_in_bladder_cancer 1 0
TSP_Pre-validation_using_Prostate_Rapid_Pre-Validation_Set. 3 0
University_of_Pittsburg_Ovarian_Data 3 0
University_of_Pittsburg_Pancreatic_Data 3 0
University_of_Washington_Immunohistochemistry_Data 1 0
University_of_Washington_Microarray_Data 1 0

So it definitely looks like OwnerPrincipal is the one to look at 😉

However for FHCRC_MALDI_Dilution_Processed_Data there seems to be an incorrect value for OwnerPrincipal; it has two values:

I've submitted a separate issue for that.

nutjob4life commented 2 years ago

bae6b00bb0733daec927f08b989d62126e3a4f16 goes ahead and adds owner principal to the RDF so the portal will have to filter. Closing this.