ARGA-Genomes / arga-data

ARGA
Mozilla Public License 2.0
0 stars 0 forks source link

BPA Data ingest #10

Open nickdos opened 2 years ago

nickdos commented 2 years ago

API documentation: https://docs.ckan.org/en/latest/api/ Full package list from API: https://data.bioplatforms.com/api/3/action/package_list Tag list: https://data.bioplatforms.com/api/3/action/tag_list Schema doc: https://docs.google.com/spreadsheets/d/1wgNknGPWhlyLU6BBpB2eV2T0eoyTrFf_/edit#gid=1783503621

To get individual records from either the package_list using the output as an ID. E.g. https://data.bioplatforms.com/api/3/action/package_show?id=bpa-tsi-pacbio-hifi-357366-da052899

https://usersupport.bioplatforms.com/programmatic_access.html

https://data.bioplatforms.com/api/3/action/package_search?q=:&rows=1000 seems to work - need to paginate at 1000 records (hard limit) - 48,438 records found, asme number reported on https://data.bioplatforms.com/dataset. Help doc linked.


Just creating a placeholder for noting down snippets of info.

Downloading a single dataset results in a ZIP file with the following structure - e.g. from https://data.bioplatforms.com/dataset/bpa-tsi-illumina-shortread-357596-h3vnhdsx2

image

The package_metadata directory contains a CSV file with the following column headings:

Organization | Title | Description | URL | Tags | Geospatial Coverage | License | Resource Permissions | Access Control Reason | Access Control Date | Access Control Mode | Sequence Data Type | Related Data | analysis_software | analysis_software_version | base_url | ccg_jira_ticket | cell_postion | data_context | data_custodian | data_generated | data_type | dataset_id | date_of_transfer | date_of_transfer_to_archive | decimal_latitude_public | decimal_longitude_public | description | dna_treatment | download | experimental_design | facility_project_code | facility_sample_id | file_count | file_name | file_type | flow_cell_id | flowcell_id | flowcell_type | folder_name | genus | insert_size_range | latitude | library_comments | library_construction_protocol | library_id | library_index_id | library_index_sequence | library_layout | library_location | library_ng_ul | library_oligo_sequence | library_pcr_cycles | library_pcr_reps | library_prep_date | library_prepared_by | library_selection | library_source | library_status | library_strategy | library_type | longitude | movie_length | n_libraries_pooled | name | notes | run_format | sample_id | sequencing_facility | sequencing_kit_chemistry_version | sequencing_model | sequencing_platform | species | specimen_id | ticket | tissue_number | work_order

The organization_metdata directory contains a CSV file with this data:

Field Value
name threatened-species
display_name Threatened Species Initiative
info_url https://bioplatforms.com/projects/threatened-species/
methods_url https://threatenedspeciesinitiative.com/

The resource_metadata directory contains a CSV with data:

Name Description Data File MD5 SHA256 File size (bytes) S3 E-Tag (8MB multipart) S3 E-Tag (16MB multipart) S3 E-Tag (32MB multipart) S3 E-Tag (64MB multipart) S3 E-Tag (128MB multipart) S3 E-Tag Verified At Format facility_id flow_cell_id index lane library_id read resource_type
357596_TSI_AGRF_H3VNHDSX2_TAACCGGT-ATCGTCTC_L004_R1.fastq.gz https://data.bioplatforms.com/dataset/bpa-tsi-illumina-shortread-357596-h3vnhdsx2/resource/8da47fd53def7c565258307dcc0db194/download/357596_TSI_AGRF_H3VNHDSX2_TAACCGGT-ATCGTCTC_L004_R1.fastq.gz 8da47fd53def7c565258307dcc0db194 8491754762 FASTQ AGRF H3VNHDSX2 TAACCGGT-ATCGTCTC L004 102.100.100/357596 R1 tsi-illumina-shortread
357596_TSI_AGRF_H3VNHDSX2_TAACCGGT-ATCGTCTC_L004_R2.fastq.gz https://data.bioplatforms.com/dataset/bpa-tsi-illumina-shortread-357596-h3vnhdsx2/resource/1100415019be8598940b226b00e1db8c/download/357596_TSI_AGRF_H3VNHDSX2_TAACCGGT-ATCGTCTC_L004_R2.fastq.gz 1100415019be8598940b226b00e1db8c 8838742827 FASTQ AGRF H3VNHDSX2 TAACCGGT-ATCGTCTC L004 102.100.100/357596 R2 tsi-illumina-shortread
nickdos commented 2 years ago

Dumping the full list of fields sorted by most populated, extracted from notebook:

field count of rows with values
organization.approval_status 48438
revision_id 48438
organization.type 48438
organization.revision_id 48438
organization.image_url 48438
organization.state 48438
organization.is_organization 48438
organization.name 48438
num_resources 48438
organization.title 48438
groups 48438
organization.created 48438
relationships_as_subject 48438
tags 48438
organization.description 48438
metadata_modified 48438
creator_user_id 48438
id 48438
private 48438
metadata_created 48438
num_tags 48438
relationships_as_object 48438
type 48438
organization.id 48438
state 48438
isopen 48438
owner_org 48438
resources 48438
title 48438
name 48438
notes 48438
resource_permissions 48410
sequence_data_type 48406
ticket 48367
license_id 48329
license_title 48329
date_of_transfer 46165
data_type 46162
access_control_date 45927
access_control_mode 45927
access_control_reason 45927
sample_id 45118
folder_name 44402
description 44402
spatial 41229
analysis_software_version 38107
flow_id 36852
amplicon 36480
reads 36166
target 36060
comments 36000
data_generated 34460
latitude 34223
longitude 34215
sample_submission_date 33628
facility 33093
base_url 33010
sample_type 31622
collection_date 31029
vegetation_type 30232
texture 30206
color 30143
url 30139
coastal_id 30090
citation 30084
nrs_trip_code 30061
nrs_sample_code 30061
host_state 30056
touching_organisms 30053
fouling_organisms 30053
information 30053
grazing_number 30053
fire_intensity_if_known 30052
crop_rotation_4yrs_since_present 30052
voyage_code 30052
crop_rotation_3yrs_since_present 30052
funding_agency 30052
utc_time_sampled 30052
imos_site_code 30052
voyage_survey_link 30052
sample_attribution 30052
crop_rotation_5yrs_since_present 30052
date_since_change_in_land_use 30052
flooding 30052
crop_rotation_2yrs_since_present 30052
fire 30052
sample_submitter 30052

top 86 results