Creating a parasite reference database - Githubissues

EBI-COMMUNITY / ebi-parasite

GNU General Public License v3.0

3 stars 2 forks source link

Creating a parasite reference database #3

Open nimapak opened 6 years ago

nimapak commented 6 years ago

Please search ENA PUBLIC databases and find all the sequences related to the following two species: At this point we would want the assemblies and reads data. List them with the all the related accessions (study, run,sample, assembly, experiment, ..) and their URL in case we want to download them. Please keep them into two different files.

Cryptosporidium parvum Cryptosporidium hominis

xinliu005 commented 6 years ago

Cryptosporidium_parvum.txt and Cryptosporidium_hominis.txt were attached. One fastq file or one primaryacc# one line.

Cryptosporidium_parvum.txt Cryptosporidium_hominis.txt

xinliu005 commented 6 years ago

The newly created files attached: 1) columns: project_id experiment_id study_id analysis_id sample_id BIOSAMPLE_ID assembly_id gc_id run_id submission_id submission_account_id data_file_id fastq_file_path dataclass entry_type dbcode set_acc acc_or_prefix status_id seq_path The above columns were sorted by acc_or_prefix firstly and then run_id secondly.

2) acc from sets were replaced by prefix.

Cryptosporidium_hominis.txt Cryptosporidium_parvum.txt

xinliu005 commented 6 years ago

Just found GCA_000209695.1 lost from Cryptosporidium_parvum.txt, so will need do further investment.

xinliu005 commented 6 years ago

1) The files were recreated, and the accessions and runs were separated in each file. Accession columns: primaryacc# statusid dataclass entry_type dbcode lineage_leaf tax_id project_id study_id experiment_id sample_id biosample_id gcs_id assembly_id analysis_id seq_url Run columns: run_id submission_id submission_account_id data_file_id data_file_path lineage_leaf tax_id project_id study_id experiment_id sample_id biosample_id gcs_id assembly_id analysis_id run_url

2) Cryptosporidium_hominis.txt contains: 1770 STD accessions 11 WGS sets 68 runs

3) Cryptosporidium_parvum.txt contains: 75059 STD and CON accessions 11 WGS sets 144 runs

4) files attached: Cryptosporidium_hominis.txt Cryptosporidium_parvum.txt

xinliu005 commented 6 years ago

1) New columns 'SEQ_LEN', 'GENOME_SEQ', and 'CHROMOSOME' were added to the two files. 'SEQ_LEN' is only valid for CON and STD, and WGS and SET were assigned with '-1'

2) For 'Cryptosporidium hominis': a) 8 chromosome entries ('LN877947' to 'LN877954', CONs) were newly added. Their set_acc is 'ERZ119325' which belong to 'Cryptosporidium hominis'. Although they were not in genome_seq table, they are seems genome sequence. b) 12 gc_ids found: GCA_000006425.2 GCA_000804495.1 GCA_001305325.1 GCA_001305395.1 GCA_001307845.1 GCA_001483505.1 GCA_001483515.1 GCA_001483535.1 GCA_001593465.1 GCA_001593475.1 GCA_001945495.1 GCA_002223825.1 c) Line3 to Line1772 are short sequences and not found in table genome_seq, so may need to be removed.

3) For 'Cryptosporidium parvum' a) 12 gc_ids found: GCA_000165345.1 GCA_000209695.1 GCA_001305335.1 GCA_001305415.1 GCA_001305435.1 GCA_001305455.2 GCA_001305475.1 GCA_001306235.1 GCA_001306245.1 GCA_002093595.1 GCA_002093605.1 GCA_002093615.1 b) Line3 to Line75051 are short sequences and not found in table genome_seq (except Line10893-Line10896), so may need to be removed.

Cryptosporidium_parvum.txt Cryptosporidium_hominis.txt

xinliu005 commented 6 years ago

Column name list and explanation:

ACCESSIONS: PRIMARYACC# : entry ID SEQ_LEN: base pair number GENOME_SEQ: whether the entry belong to a genome. If not, 'N'; if yes, 'Y' CHROMOSOME: whether the entry belong to a chromosome. If not, 'None', if yes, chromosome number will be display. DATACLASS: which data type the entry belong to. Including: STD, CON, WGS, and etc. ENTRY_TYPE: using number to represent the data type, such as '0' for 'STD', '1' for 'CON', '3' for WGS DBCODE: which data center create the entry, such as 'D' for DDBJ, 'E' for ENA, and 'G' for NCBI LINEAGE_LEAF: the genome name TAX_ID: taxonomy ID PROJECT_ID: project ID STUDY_ID: study ID EXPERIMENT_ID: experiment ID SAMPLE_ID: secondary accession of sample BIOSAMPLE_ID: first accession of sample GCS_ID: genome collection ID ASSEMBLY_ID : genome assembly ID ANALYSIS_ID: project data analysis ID SEQ_URL: the ENA url for the entry

RUNS: RUN_ID: run ID SUBMISSION_ID: run submission ID SUBMISSION_ACCOUNT_ID: run submission account ID DATA_FILE_ID: run file ID DATA_FILE_PATH: run file path LINEAGE_LEAF: genome name TAX_ID: taxonomy ID PROJECT_ID: project ID STUDY_ID: study ID EXPERIMENT_ID: experiment ID SAMPLE_ID: secondary accession of sample BIOSAMPLE_ID: first accession of sample GCS_ID: genome collection ID ASSEMBLY_ID : genome assembly ID ANALYSIS_ID: project data analysis ID RUN_URL; ena url for the run

xinliu005 commented 6 years ago

Column name list and explanation:

In file assembly_and_annotation.*.txt: PRIMARYACC# : entry ID SEQ_LEN: base pair number, this applies to all dataclasses except WGS and SET, whose seq_len are assigned to "-1" CHROMOSOME: whether the entry belong to a chromosome. If not, 'None', if yes, chromosome number will be display. DATACLASS: which data type the entry belong to. Including: CON: Entry constructed from segment entry sequences; if unannotated, annotation may be drawn from segment entries EST: Expressed Sequence Tag GSS: Genome Survey Sequence PAT: Patent PRT: Patent Proteins STD : Standard (all entries not classified as above) STS: Sequence Tagged Site
SET: WGS master TSA: Transcriptome Shotgun Assembly WGS: Whole Genome Shotgun

ENTRY_TYPE: using number to represent the data type, such as '0' for 'STD', '1' for 'CON', '3' for WGS DBCODE: which data center create the entry, such as 'D' for DDBJ, 'E' for ENA, and 'G' for NCBI LINEAGE_LEAF: name of the lineage leaf node TAX_ID: taxonomy ID PROJECT_ID: project ID STUDY_ID: study ID EXPERIMENT_ID: experiment ID SAMPLE_ID: secondary accession of sample BIOSAMPLE_ID: first accession of sample GCS_ID: genome collection ID ASSEMBLY_ID : genome assembly ID ANALYSIS_ID: project data analysis ID SEQ_URL: the ENA url for the entry

In file reads.*.txt RUN_ID: run ID SUBMISSION_ID: run submission ID SUBMISSION_ACCOUNT_ID: run submission account ID DATA_FILE_ID: run file ID DATA_FILE_PATH: run file path LINEAGE_LEAF: name of the lineage leaf node TAX_ID: taxonomy ID PROJECT_ID: project ID STUDY_ID: study ID EXPERIMENT_ID: experiment ID SAMPLE_ID: secondary accession of sample BIOSAMPLE_ID: first accession of sample GCS_ID: genome collection ID ASSEMBLY_ID : genome assembly ID ANALYSIS_ID: project data analysis ID RUN_URL; ena url for the run

xinliu005 commented 6 years ago

assembly_and_annotation.cp.txt reads.cp.txt assembly_and_annotation.ch.txt reads.ch.txt