Open nepsygirl opened 2 years ago
Hi @nepsygirl. Thank you for the question.
Indeed it occurred to me that this old download script has a certification problem. I fixed it in my other repo but forgot to update it in this repo. Thanks to your question, I now also fix it in this repo.
About the "genome", it is the KEGG's genome database, no substitution is necessary. The script will download the whole genome database from KEGG and put it into your [genome_output_folder], including gn:T04342. So for example, I will just run:
python download_kegg_various_databases.py genome /home/dgg32/Downloads/kegg_genome/
It looks like this:
Please try.
Greetings!
Thank you for the quick reply. The genome parser is also not working, it is not generating the mapping.csv file. Could you also check them if possible.
Also, after trying so many times, this parser.py script is still not working to generate csv files. Could you please fix the bugs? Thank you in advance.
Hi @nepsygirl I tested the new version and it works. Could you please post your error message? Thanks!
(base) pp@PPs-MacBook-Air neo4j_genome_ko % ls -l
total 24
-rw-r--r-- 1 pp staff 1652 Jul 27 08:56 download_kegg_various_databases.py
-rw-r--r-- 1 pp staff 3286 Jul 27 22:06 genome_parser.py
drwxr-xr-x 20114 pp staff 643648 Jul 27 15:58 kegg_databases
-rw-r--r--@ 1 pp staff 2690 Jul 26 20:36 kegg_parser.py
drwxr-xr-x 25258 pp staff 808256 Jul 27 20:37 ko_database
drwxr-xr-x 13 pp staff 416 Jul 26 20:36 neo4j_genome_ko
(base) pp@PPs-MacBook-Air neo4j_genome_ko % python genome_parser.py kegg_databases phylum Proteobacteria
(base) pp@PPs-MacBook-Air neo4j_genome_ko % ls -l
total 40
-rw-r--r-- 1 pp staff 8 Jul 28 10:14 connections.csv
-rw-r--r-- 1 pp staff 1652 Jul 27 08:56 download_kegg_various_databases.py
-rw-r--r-- 1 pp staff 3286 Jul 27 22:06 genome_parser.py
drwxr-xr-x 20114 pp staff 643648 Jul 27 15:58 kegg_databases
-rw-r--r--@ 1 pp staff 2690 Jul 26 20:36 kegg_parser.py
drwxr-xr-x 25258 pp staff 808256 Jul 27 20:37 ko_database
drwxr-xr-x 13 pp staff 416 Jul 26 20:36 neo4j_genome_ko
-rw-r--r-- 1 pp staff 16 Jul 28 10:14 taxon.csv
Well I downloaded all the databases and performed genome parser, it executes but the csv is broken it does not provide enough information, maybe something is wrong with extracting datas to csv. Also mapping.csv is missing.
As a matter of fact, connections.csv only generates empty datasheet:
Also the same with taxon.csv
Thanks!
About prephy.py, I wonder why you got the error "table tree already exists". So have you already run prepyphy once before? If so could you please delete the "ncbi" file and run the command again?
Can you show me what is in your kegg_databases please?
Also, please show me how you ran download_kegg_various_databases.py?
If you have files in the kegg_databases folder, do they look like this?
ENTRY T00004 Complete Genome
ORG_CODE syn
NAME Synechocystis sp. PCC 6803
CATEGORY Reference genome
ANNOTATION yes
TAXONOMY TAX:1148
LINEAGE Bacteria; Cyanobacteria; Synechococcales; Merismopediaceae; Synechocystis
DATA_SOURCE GenBank (Assembly:GCA_000009725.1)
BioProject:60
ORIGINAL_DB CyanoBase
KEYWORDS Photosynthesis
CHROMOSOME Circular
SEQUENCE GB:BA000022
LENGTH 3573470
PLASMID pSYSA; Circular
SEQUENCE GB:AP004311
LENGTH 103307
PLASMID pSYSG; Circular
SEQUENCE GB:AP004312
LENGTH 44343
PLASMID pSYSM; Circular
SEQUENCE GB:AP004310
LENGTH 119895
PLASMID pSYSX; Circular
SEQUENCE GB:AP006585
LENGTH 106004
STATISTICS Number of nucleotides: 3947019
Number of protein genes: 3564
Number of RNA genes: 50
CREATED 1996
REFERENCE PMID:8905231
AUTHORS Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, et al.
TITLE Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions.
JOURNAL DNA Res 3:109-36 (1996)
DOI:10.1093/dnares/3.3.109
REFERENCE PMID:8590279
AUTHORS Kaneko T, Tanaka A, Sato S, Kotani H, Sazuka T, Miyajima N, Sugiura M, Tabata S
TITLE Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb region from map positions 64% to 92% of the genome.
JOURNAL DNA Res 2:153-66, 191-8 (1995)
DOI:10.1093/dnares/2.4.153
///
About prephy.py, I wonder why you got the error "table tree already exists". So have you already run prepyphy once before? If so could you please delete the "ncbi" file and run the command again?
This was just to check i had run it already, i also checked previous issue in this repo and worked accordingly.
I removed ncbi and runned it again, i received following outputs which are empty:
Can you show me what is in your kegg_databases please?
Here it is : it does show the same that you are showing to me.
this one another one.
ENTRY T40296 Viral Genome
NAME Turnip mosaic virus
TAXONOMY TAX:12230
LINEAGE Viruses; Riboviria; Orthornavirae; Pisuviricota; Stelpaviricetes; Patatavirales; Potyviridae; Potyvirus
SEQUENCE RS:NC_002509
COMMENT Plant disease: Mosaic
Host: Brassica juncea (Indian mustard) [TAX:3707], Anemone coronaria [TAX:167998], Brassicaceae (mustard family) [TAX:3700], Lactuca sativa (cultivated lettuce) [TAX:4236]
Vector: Myzus persicae (green peach aphid) [TAX:13164], Brevicoryne brassicae (cabbage aphid) [TAX:69196]
DBLINKS Virus-HostDB: 12230
REFERENCE PMID:11043471
AUTHORS Jenner CE, Sanchez F, Nettleship SB, Foster GD, Ponz F, Walsh JA
TITLE The cylindrical inclusion gene of Turnip mosaic virus encodes a pathogenic determinant to the Brassica resistance gene TuRB01.
JOURNAL Mol Plant Microbe Interact 13:1102-8 (2000)
DOI:10.1094/MPMI.2000.13.10.1102
///
Also, please show me how you ran download_kegg_various_databases.py?
I just used this line to download in the same folder where i had download_kegg_various_databases.py, which eventually downloaded all of them, took few hours.
python download_kegg_various_databases.py genome kegg_databases
@nepsygirl I see, KEGG has changed its file format. So my parse no longer works. I will write a new version.
Oh thank you so so much.
Sixing Huang @.***> schrieb am Fr., 29. Juli 2022, 10:40:
@nepsygirl https://github.com/nepsygirl I see, KEGG has changed its file format. So my parse no longer works. I will write a new version.
— Reply to this email directly, view it on GitHub https://github.com/dgg32/neo4j_genome_ko/issues/2#issuecomment-1199025385, or unsubscribe https://github.com/notifications/unsubscribe-auth/APMY7Q2S75CXVB5DY3MQEPDVWOKIFANCNFSM54XAFJTA . You are receiving this because you were mentioned.Message ID: @.***>
Hi. @nepsygirl I have written a new version. Please pull the newest version, delete your taxon.csv, connections.csv and mapping.tsv. And then run genome_parser again. Example:
python genome_parser.py '/home/huangsixing/Downloads/kegg_genome' superkingdom Archaea
Hope this help.
Hi @dgg32 it did work, thank you but looks like kegg_parser.py needs to be changed too, the names are not being fetched. The name fields are empty, as in the above picture for kegg.csv. I tried to edit the code but failed till now.
OK. I will work on it on Monday. Please wait.
@nepsygirl Done. Please try with the new version.
Hi,
I tried using many names but could not figure out the issue. I thought I gave the wrong genome name. Furthermore, I'd like to know what I should substitute for genome.
Python download_kegg_various_databases.py genome [genome_output_folder]
And also, the Python script to download the database did not work. Would it be possible for you to refer me in the right way to download it?
To download the Proteobacteria KEGG databases, just like you did.
For eg: I would like to download the genome from https://www.genome.jp/entry/gn:T04342, what genome name should i enter here.
Thank you!