Genom - Githubissues

nepsygirl commented 2 years ago

Hi,

I tried using many names but could not figure out the issue. I thought I gave the wrong genome name. Furthermore, I'd like to know what I should substitute for genome.

Python download_kegg_various_databases.py genome [genome_output_folder]

And also, the Python script to download the database did not work. Would it be possible for you to refer me in the right way to download it?

To download the Proteobacteria KEGG databases, just like you did.

For eg: I would like to download the genome from https://www.genome.jp/entry/gn:T04342, what genome name should i enter here.

Thank you!

dgg32 commented 2 years ago

Hi @nepsygirl. Thank you for the question.

Indeed it occurred to me that this old download script has a certification problem. I fixed it in my other repo but forgot to update it in this repo. Thanks to your question, I now also fix it in this repo.

About the "genome", it is the KEGG's genome database, no substitution is necessary. The script will download the whole genome database from KEGG and put it into your [genome_output_folder], including gn:T04342. So for example, I will just run:

python download_kegg_various_databases.py genome /home/dgg32/Downloads/kegg_genome/

It looks like this:

Please try.

Greetings!

nepsygirl commented 2 years ago

Thank you for the quick reply. The genome parser is also not working, it is not generating the mapping.csv file. Could you also check them if possible.

nepsygirl commented 2 years ago

Also, after trying so many times, this parser.py script is still not working to generate csv files. Could you please fix the bugs? Thank you in advance.

dgg32 commented 2 years ago

Hi @nepsygirl I tested the new version and it works. Could you please post your error message? Thanks!

nepsygirl commented 2 years ago

(base) pp@PPs-MacBook-Air neo4j_genome_ko % ls -l
total 24
-rw-r--r--      1 pp  staff    1652 Jul 27 08:56 download_kegg_various_databases.py
-rw-r--r--      1 pp  staff    3286 Jul 27 22:06 genome_parser.py
drwxr-xr-x  20114 pp  staff  643648 Jul 27 15:58 kegg_databases
-rw-r--r--@     1 pp  staff    2690 Jul 26 20:36 kegg_parser.py
drwxr-xr-x  25258 pp  staff  808256 Jul 27 20:37 ko_database
drwxr-xr-x     13 pp  staff     416 Jul 26 20:36 neo4j_genome_ko
(base) pp@PPs-MacBook-Air neo4j_genome_ko % python genome_parser.py kegg_databases phylum Proteobacteria       
(base) pp@PPs-MacBook-Air neo4j_genome_ko % ls -l
total 40
-rw-r--r--      1 pp  staff       8 Jul 28 10:14 connections.csv
-rw-r--r--      1 pp  staff    1652 Jul 27 08:56 download_kegg_various_databases.py
-rw-r--r--      1 pp  staff    3286 Jul 27 22:06 genome_parser.py
drwxr-xr-x  20114 pp  staff  643648 Jul 27 15:58 kegg_databases
-rw-r--r--@     1 pp  staff    2690 Jul 26 20:36 kegg_parser.py
drwxr-xr-x  25258 pp  staff  808256 Jul 27 20:37 ko_database
drwxr-xr-x     13 pp  staff     416 Jul 26 20:36 neo4j_genome_ko
-rw-r--r--      1 pp  staff      16 Jul 28 10:14 taxon.csv

Well I downloaded all the databases and performed genome parser, it executes but the csv is broken it does not provide enough information, maybe something is wrong with extracting datas to csv. Also mapping.csv is missing.

As a matter of fact, connections.csv only generates empty datasheet:

Also the same with taxon.csv

Thanks!

dgg32 commented 2 years ago

About prephy.py, I wonder why you got the error "table tree already exists". So have you already run prepyphy once before? If so could you please delete the "ncbi" file and run the command again?

dgg32 commented 2 years ago

Can you show me what is in your kegg_databases please?

dgg32 commented 2 years ago

Also, please show me how you ran download_kegg_various_databases.py?

dgg32 commented 2 years ago

If you have files in the kegg_databases folder, do they look like this?

ENTRY       T00004            Complete  Genome
ORG_CODE    syn
NAME        Synechocystis sp. PCC 6803
CATEGORY    Reference genome
ANNOTATION  yes
TAXONOMY    TAX:1148
  LINEAGE   Bacteria; Cyanobacteria; Synechococcales; Merismopediaceae; Synechocystis
DATA_SOURCE GenBank (Assembly:GCA_000009725.1)
            BioProject:60
ORIGINAL_DB CyanoBase
KEYWORDS    Photosynthesis
CHROMOSOME  Circular
  SEQUENCE  GB:BA000022
  LENGTH    3573470
PLASMID     pSYSA; Circular
  SEQUENCE  GB:AP004311
  LENGTH    103307
PLASMID     pSYSG; Circular
  SEQUENCE  GB:AP004312
  LENGTH    44343
PLASMID     pSYSM; Circular
  SEQUENCE  GB:AP004310
  LENGTH    119895
PLASMID     pSYSX; Circular
  SEQUENCE  GB:AP006585
  LENGTH    106004
STATISTICS  Number of nucleotides:       3947019
            Number of protein genes:        3564
            Number of RNA genes:              50
CREATED     1996
REFERENCE   PMID:8905231
  AUTHORS   Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, et al.
  TITLE     Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions.
  JOURNAL   DNA Res 3:109-36 (1996)
            DOI:10.1093/dnares/3.3.109
REFERENCE   PMID:8590279
  AUTHORS   Kaneko T, Tanaka A, Sato S, Kotani H, Sazuka T, Miyajima N, Sugiura M, Tabata S
  TITLE     Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb region from map positions 64% to 92% of the genome.
  JOURNAL   DNA Res 2:153-66, 191-8 (1995)
            DOI:10.1093/dnares/2.4.153
///

nepsygirl commented 2 years ago

About prephy.py, I wonder why you got the error "table tree already exists". So have you already run prepyphy once before? If so could you please delete the "ncbi" file and run the command again?

This was just to check i had run it already, i also checked previous issue in this repo and worked accordingly.

I removed ncbi and runned it again, i received following outputs which are empty:

Can you show me what is in your kegg_databases please?

Here it is : it does show the same that you are showing to me.

this one another one.

ENTRY       T40296            Viral     Genome
NAME        Turnip mosaic virus
TAXONOMY    TAX:12230
  LINEAGE   Viruses; Riboviria; Orthornavirae; Pisuviricota; Stelpaviricetes; Patatavirales; Potyviridae; Potyvirus
  SEQUENCE  RS:NC_002509
COMMENT     Plant disease: Mosaic
            Host: Brassica juncea (Indian mustard) [TAX:3707], Anemone coronaria [TAX:167998], Brassicaceae (mustard family) [TAX:3700], Lactuca sativa (cultivated lettuce) [TAX:4236]
            Vector: Myzus persicae (green peach aphid) [TAX:13164], Brevicoryne brassicae (cabbage aphid) [TAX:69196]
DBLINKS     Virus-HostDB: 12230
REFERENCE   PMID:11043471
  AUTHORS   Jenner CE, Sanchez F, Nettleship SB, Foster GD, Ponz F, Walsh JA
  TITLE     The cylindrical inclusion gene of Turnip mosaic virus encodes a pathogenic determinant to the Brassica resistance gene TuRB01.
  JOURNAL   Mol Plant Microbe Interact 13:1102-8 (2000)
            DOI:10.1094/MPMI.2000.13.10.1102
///

Also, please show me how you ran download_kegg_various_databases.py?

I just used this line to download in the same folder where i had download_kegg_various_databases.py, which eventually downloaded all of them, took few hours.

python download_kegg_various_databases.py genome kegg_databases

dgg32 commented 2 years ago

@nepsygirl I see, KEGG has changed its file format. So my parse no longer works. I will write a new version.

nepsygirl commented 2 years ago

Oh thank you so so much.

Sixing Huang @.***> schrieb am Fr., 29. Juli 2022, 10:40:

@nepsygirl https://github.com/nepsygirl I see, KEGG has changed its file format. So my parse no longer works. I will write a new version.

— Reply to this email directly, view it on GitHub https://github.com/dgg32/neo4j_genome_ko/issues/2#issuecomment-1199025385, or unsubscribe https://github.com/notifications/unsubscribe-auth/APMY7Q2S75CXVB5DY3MQEPDVWOKIFANCNFSM54XAFJTA . You are receiving this because you were mentioned.Message ID: @.***>

dgg32 commented 2 years ago

Hi. @nepsygirl I have written a new version. Please pull the newest version, delete your taxon.csv, connections.csv and mapping.tsv. And then run genome_parser again. Example:

python genome_parser.py '/home/huangsixing/Downloads/kegg_genome' superkingdom Archaea

Hope this help.

nepsygirl commented 2 years ago

Hi @dgg32 it did work, thank you but looks like kegg_parser.py needs to be changed too, the names are not being fetched. The name fields are empty, as in the above picture for kegg.csv. I tried to edit the code but failed till now.

dgg32 commented 2 years ago

OK. I will work on it on Monday. Please wait.

dgg32 commented 2 years ago

@nepsygirl Done. Please try with the new version.

dgg32 / neo4j_genome_ko

Genom #2