WeiliWw / VirHostMatcher-Net

VirHostMatcher-Net: A network-based computational tool for predicting virus-host interactions.
19 stars 1 forks source link

VirHostMatcher-Net Panda error #17

Closed miczuppi closed 2 years ago

miczuppi commented 2 years ago

Hi, I have been getting this error

  File "pandas/_libs/parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error
  pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 230132, saw 4

Could you help me fix this?

WeiliWw commented 2 years ago

Can you paste the entire output message and some data example you use?

miczuppi commented 2 years ago

Entire output message

----Calculating crispr feature values for  combined.fasta  ----
Traceback (most recent call last):
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/VirHostMatcher-Net/VirHostMatcher-Net.py", line 56, in <module>
    predictor = HostPredictor(query_virus_dir, args.short_contig, intermediate_dir, genome_list, args.num_Threads)
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/VirHostMatcher-Net/predictor.py", line 47, in __init__
    self._crispr_signals = src.crispr.crispr_calculator(query_virus_dir, intermediate_dir, numThreads)
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/VirHostMatcher-Net/src/crispr.py", line 82, in crispr_calculator
    ind, df = crisprSingle(item, query_virus_dir, crispr_output_dir, numThreads)
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/VirHostMatcher-Net/src/crispr.py", line 51, in crisprSingle
    query_res = pd.read_table(output_file,header = None)
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 683, in read_table
    return _read(filepath_or_buffer, kwds)
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 488, in _read
    return parser.read(nrows)
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1047, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/mnt/projects/miniconda2/envs/virhostmatchernet/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 223, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 801, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 230132, saw 4

I have been running VirHostMatcher-Net on a multifasta file:

DF12_D2_k141_1654_1 1-920/2415 CATTATAGGCTGAAAAATCTGTATTGCCCAGACTTGCATTTACCAGACCCGTCCAGACTTCCGAACCGGTTGTTTTTAACAAATAACATATGTAAGTATAAACTATATACACTTAAATTTAAAGTCCAAAAATTTTAATTTCACTGTATACATATATAAATTATCACTATAAAAAAACAAATCCGTATACAAAAAAGAGTATCACTGTTAGATAATTTAAAACAGTTTTACTCTTTTTTATATTTCAAAATATAATTTTTAATCTGATTTCTAAGTCTGTTCCATTTATCCGGGTTGTTCACATAATAAAGCGGGCAGAGTTTACCTGTAACATCATAATGTCTGATAATGTCAGAATTTTCAATTTGGTATTTCGCACACAACCATCCCGCAAGCTTTATAACAGAGTTGTACGTTTTTTTTGAAAATTTACCTGAACTGTCCGGATGACAGCATTCGATAGATATTGTGTCTGAATTTCTATTATTTGAAGCATATGAAATCTCATTTAACGGAATACATTGTATAATAGTTCCGTCCAATCCGATTATAAAATGACTGCTCACTTTATTTGCCGTATCATCAGATAAGTTTTTTCTGCTCTCAAAATAATTTCTGTTTGCCATCGCATCGGTTCCCGGATTTGCCGTATAATGTATAACTATACCTTTTATTTTTCTAAGCTTTATTCCCGGTCTTGAATTTTTATTTACAGTAAGCAATGCCTTTTTCACATTTGGTTTTGGAACGACATACTGTTCATAATTTATTCTGCTGCTCTTGGTGCCGATTTTTTTGGTAATTGCAGATTTTACAAGCATAAATACGACTGTCATTATAATGCATATAACAACAGCCGTACCCCACATTTTTAATACTTTCAATCTGCGGCGACGCTTCGCTTTTGAAAGTTTCTTCAT DF12_D2_k141_1703_1 1516-1958/1958 GGGTGCGCATCCGGGATGTGGTAGCGCCGGTGTTCTGGCCGGTGCACCGCGCCATTGCCCGCGGCACAGTTCAGGAACTGGTGGCCAAGGGCGGGCGCGGCAGCGGCAAATCCAGCTATATTTCCATTGAGCTTGTTTTGCAGCTGCTGCGCCACCCCGCCTGCCACGCGGTGGTGCTGCGCAAGATCGGCGGCACGCTGCGCACCAGTGTGTATGCGCAGATCCAGTGGGCCATTGGGGCGCTGGGGCTGGCAAAGCAGTTCCGCTGCACCGTCAGCCCCATGGAGTGTACTTATCTGCCCACAGGGCAGAAGATCCTCTTTTTTGGCACCGACGACCCCGGCAAGCTGAAAAGCATCAAGGTGCCATTTGGAGCCATCGGCCTGGCCTGGTTCGAGGAGCTGGACCAGTTCGACGGCCCCGAGGAGGTGCGCAACGTCGAG DF12_D2_k141_2905_1 1-1607/2573 ACTTTTGTAAGAGATGCACATCCCTTAAATGCATATTTTCCAACTTCCGTCACACTGTACGGAACGGATACATAGGTGATCTTTGTGTTTCCTCTCAGTGCCCCTTCTGCGATCGCCGTTACATTATACGTTATGCCGTTGATCTCCACCTGAGATGGGATGCTCACGGAAGTGATCTCCTCATCCAGCACGCCCGCGTAGGAAACTGCTTTGCTTCTGGTGCTTACAAGATACTTGCCGCCCGTTTTGCTGTCCATTAGAACCGTTCCGGCAGTCGGTGCCACATAGCTGACCGGCAGCATGGTGGTCTGTTTCAGATACGGATAACCGCCGTTTTCTGCTGCATTCAAAGCCCAGATGCCATCGAAATCAAAATCTTTGAAATAGCTCTGTGTCTTGATCTGCACATCATTGAGTGCCGTTGCTGTTCCGGTAATGCAGTTTCCTGTTTCAAATGCATAGACCGGATTCATGCTGTAATAGTAGCTGTTTGAAATGCTGCAGCCGGATGTCTGTGTGCTGCCTGCTGCCATTGTGGACGTGCCGACCATGCCGGATGCCACAGAATAATAAAACTGAATACCGACACATTTGTTGATCACGACCTGACCGTTCCCTGCTGTGATCTTCGCTACGATATTGGCCCCTTTTTCCAATACGCCTGCGTTGTAGCAGTTTGCGATCTCAATGTTTCTTGCCGCTGCCAGATTT

...

WeiliWw commented 2 years ago

Thanks for the information! I suppose you did add '>' to all the headers like 'DF12_D2_k141_1654_1 1-920/2415' in the multifasta file? Otherwise, there should be an error early on. If so, please further check the following items:

  1. I feel there might be a header formatting issue/bug in the input file, could you help locate the line that triggers the error: line 230132 in file $INTERMEDIATE_DIR/CRISPR/combined.crispr, where $INTERMEDIATE_DIR is what you specified in the option -i. Likely line 230132 will have four fields and we will know what is wrong there.

  2. Generally, we use VirHostMatcher-Net for a group of separate fasta files, because its final report will be corresponding to each fasta file rather than the contigs within it. You may get meaningful results by using combined.fasta only if you believe contigs in combined.fasta are from the same virus.

miczuppi commented 2 years ago

Thanks for the quick reply. All the header begins with ">". You were right, these are lines 230131-230133:

DF12_D2_k141_76125      2.640022|GCF_001698755.1|       0.40
DF12_D2_k141_76125      41DF17_D3_k141_93818||full      4.402867|GCF_007674265.1|       0.12
DF17_D3_k141_93818||full        92.1729172|GCF_000300715.1|     0.12

Line 230132 appears to present an extra field which have caused the error. I am currently running VirHostMatcher-Net on separate fasta files. Shall the same problem occur again, I will know how to solve it. Thank you very much for the quick and helpful support.

WeiliWw commented 2 years ago

The *.crispr file is the direct output from blastn, so I guess there might be some writing issue in blastn when dealing with large files (maybe due to multi-threading...) Anyway, feel free to reopen this thread if there is a new issue.