SuLab / scheduled-bots

GeneWiki Scheduled Bots
MIT License
9 stars 15 forks source link

GeneBot_microbes is disabled #59

Closed andrawaag closed 4 years ago

andrawaag commented 4 years ago

The bot does not run. The output suggests that there is an data issue with: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt

pandas.errors.ParserError: Error tokenizing data. C error: Expected 23 fields in line 33502, saw 24 see: console output for details

andrawaag commented 4 years ago

The bot chokes on an additional , in the data downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt. In it there are rows who mention "Biological Resource Center, Korea Research Institute of Bioscience and Biotechnology" the comma between Center an Korea, is being separated into two columns on which the bot chokes since in those rows it is seeing one more field.

andrawaag commented 4 years ago

I adapted the bot to ignore the lines in the file that have some additional tab encoded in the data. This is done adding a paramater to

The bad lines are ignored and not considered by the bot. I wrote a script to identify which lines are ignored in the bot run To fix this issue the following lines are reported:

b'Skipping line 33938: expected 23 fields, saw 24\n
Skipping line 34196: expected 23 fields, saw 24\n
Skipping line 34436: expected 23 fields, saw 24\n
Skipping line 34589: expected 23 fields, saw 24\n
Skipping line 34788: expected 23 fields, saw 24\n
Skipping line 35168: expected 23 fields, saw 24\n
Skipping line 41881: expected 23 fields, saw 24\n
Skipping line 45604: expected 23 fields, saw 24\n
Skipping line 46126: expected 23 fields, saw 24\n
Skipping line 57543: expected 23 fields, saw 24\n'
'Skipping line 168308: expected 23 fields, saw 24\n'
andrawaag commented 4 years ago

Line 33938: "Complete Genome Biological Resource Center, National Institute of Technology and Evaluation (NBRC)"

line 34196: "FCB group Bacteroidetes/Chlorobi group" line 34436: "Complete Genome Biological Resource Center, National Institute of Technology and Evaluation (NBRC)" line 34589: "Complete Genome Biological Resource Center, National Institute of Technology and Evaluation (NBRC)" line 34788: "Complete Genome Biological Resource Center, National Institute of Technology and Evaluation (NBRC)" line 35168: "Contig Biological Resource Center, National Institute of Technology and Evaluation (NBRC)" line 41881: "Contig Biological Resource Center, National Institute of Technology and Evaluation (NBRC)" line 45604: "Contig Biological Resource Center, National Institute of Technology and Evaluation (NBRC)" line 46126: Complete Genome Biological Resource Center, National Institute of Technology and Evaluation (NBRC) line 57543: Complete Genome Biological Resource Center, National Institute of Technology and Evaluation (NBRC) line 168308: Contig Biological Resource Center, National Institute of Technology and Evaluation (NBRC)

andrewsu commented 4 years ago

@andrawaag from the README at ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS, looks like we should submit that bug report to genomes@ncbi.nlm.nih.gov. LMK if you want me to handle it...

andrawaag commented 4 years ago

I feel a bit silly for not noticing this. I have just sent an email with this issue to that suggested address.

andrawaag commented 4 years ago

The bot is running again