BPA Data ingest - Githubissues

API documentation: https://docs.ckan.org/en/latest/api/ Full package list from API: https://data.bioplatforms.com/api/3/action/package_list Tag list: https://data.bioplatforms.com/api/3/action/tag_list Schema doc: https://docs.google.com/spreadsheets/d/1wgNknGPWhlyLU6BBpB2eV2T0eoyTrFf_/edit#gid=1783503621

To get individual records from either the package_list using the output as an ID. E.g. https://data.bioplatforms.com/api/3/action/package_show?id=bpa-tsi-pacbio-hifi-357366-da052899

https://usersupport.bioplatforms.com/programmatic_access.html

https://data.bioplatforms.com/api/3/action/package_search?q=:&rows=1000 seems to work - need to paginate at 1000 records (hard limit) - 48,438 records found, asme number reported on https://data.bioplatforms.com/dataset. Help doc linked.

Just creating a placeholder for noting down snippets of info.

Downloading a single dataset results in a ZIP file with the following structure - e.g. from https://data.bioplatforms.com/dataset/bpa-tsi-illumina-shortread-357596-h3vnhdsx2

The package_metadata directory contains a CSV file with the following column headings:

The organization_metdata directory contains a CSV file with this data:

Field	Value
name	threatened-species
display_name	Threatened Species Initiative
info_url	https://bioplatforms.com/projects/threatened-species/
methods_url	https://threatenedspeciesinitiative.com/

The resource_metadata directory contains a CSV with data:

Name	Description	Data File	MD5	SHA256	File size (bytes)	S3 E-Tag (8MB multipart)	S3 E-Tag (16MB multipart)	S3 E-Tag (32MB multipart)	S3 E-Tag (64MB multipart)	S3 E-Tag (128MB multipart)	S3 E-Tag Verified At	Format	facility_id	flow_cell_id	index	lane	library_id	read	resource_type
357596_TSI_AGRF_H3VNHDSX2_TAACCGGT-ATCGTCTC_L004_R1.fastq.gz		https://data.bioplatforms.com/dataset/bpa-tsi-illumina-shortread-357596-h3vnhdsx2/resource/8da47fd53def7c565258307dcc0db194/download/357596_TSI_AGRF_H3VNHDSX2_TAACCGGT-ATCGTCTC_L004_R1.fastq.gz	8da47fd53def7c565258307dcc0db194		8491754762							FASTQ	AGRF	H3VNHDSX2	TAACCGGT-ATCGTCTC	L004	102.100.100/357596	R1	tsi-illumina-shortread
357596_TSI_AGRF_H3VNHDSX2_TAACCGGT-ATCGTCTC_L004_R2.fastq.gz		https://data.bioplatforms.com/dataset/bpa-tsi-illumina-shortread-357596-h3vnhdsx2/resource/1100415019be8598940b226b00e1db8c/download/357596_TSI_AGRF_H3VNHDSX2_TAACCGGT-ATCGTCTC_L004_R2.fastq.gz	1100415019be8598940b226b00e1db8c		8838742827							FASTQ	AGRF	H3VNHDSX2	TAACCGGT-ATCGTCTC	L004	102.100.100/357596	R2	tsi-illumina-shortread

Dumping the full list of fields sorted by most populated, extracted from notebook:

field	count of rows with values
organization.approval_status	48438
revision_id	48438
organization.type	48438
organization.revision_id	48438
organization.image_url	48438
organization.state	48438
organization.is_organization	48438
organization.name	48438
num_resources	48438
organization.title	48438
groups	48438
organization.created	48438
relationships_as_subject	48438
tags	48438
organization.description	48438
metadata_modified	48438
creator_user_id	48438
id	48438
private	48438
metadata_created	48438
num_tags	48438
relationships_as_object	48438
type	48438
organization.id	48438
state	48438
isopen	48438
owner_org	48438
resources	48438
title	48438
name	48438
notes	48438
resource_permissions	48410
sequence_data_type	48406
ticket	48367
license_id	48329
license_title	48329
date_of_transfer	46165
data_type	46162
access_control_date	45927
access_control_mode	45927
access_control_reason	45927
sample_id	45118
folder_name	44402
description	44402
spatial	41229
analysis_software_version	38107
flow_id	36852
amplicon	36480
reads	36166
target	36060
comments	36000
data_generated	34460
latitude	34223
longitude	34215
sample_submission_date	33628
facility	33093
base_url	33010
sample_type	31622
collection_date	31029
vegetation_type	30232
texture	30206
color	30143
url	30139
coastal_id	30090
citation	30084
nrs_trip_code	30061
nrs_sample_code	30061
host_state	30056
touching_organisms	30053
fouling_organisms	30053
information	30053
grazing_number	30053
fire_intensity_if_known	30052
crop_rotation_4yrs_since_present	30052
voyage_code	30052
crop_rotation_3yrs_since_present	30052
funding_agency	30052
utc_time_sampled	30052
imos_site_code	30052
voyage_survey_link	30052
sample_attribution	30052
crop_rotation_5yrs_since_present	30052
date_since_change_in_land_use	30052
flooding	30052
crop_rotation_2yrs_since_present	30052
fire	30052
sample_submitter	30052

top 86 results

ARGA-Genomes / arga-data

BPA Data ingest #10