bioforensics / pytaxonkit

Python bindings for the TaxonKit library
Other
31 stars 2 forks source link

Bug with taxonomy IDs < 10 not returning names #25

Closed philippbayer closed 2 years ago

philippbayer commented 2 years ago

When running pytaxonkit.name() on a single taxonomy ID below 10, pytaxonkit returns an empty DataFrame.

I'm using pytaxonkit v0.7.2 in Python 3.10.4, both installed via conda with taxonkit v0.7.2.

Example:

# should return Buchnera aphidicola
>>> pytaxonkit.name([9])
Empty DataFrame
Columns: [TaxID, Name]

# should return Bacteria
>>> pytaxonkit.name([2])
Empty DataFrame
Columns: [TaxID, Name]

# taxonomy IDs equal or above 10 work fine
>>> pytaxonkit.name([10])
   TaxID        Name
0     10  Cellvibrio

These IDs work fine in bash:

$ echo 2 | taxonkit lineage --show-name --no-lineage
2       Bacteria

Interestingly, when I add more taxonomy IDs the problem goes away.

>>> pytaxonkit.name([2, 7777])
   TaxID            Name
0      2        Bacteria
1   7777  Chondrichthyes

I am unsure what causes this. My current workaround is to add a taxonomy ID to the list when len(list) == 1, and then remove the last row.

Looking at the code in https://github.com/bioforensics/pytaxonkit/blob/d24a1a5b6b295771c6485e3d5b951a0f66fce957/pytaxonkit.py#L410, it seems to ignore the err variable which is this when the taxonomy ID is < 10:

>>> arglist - 
>>> idlist = '\n'.join(['2'])
>>> proc = Popen(arglist, stdin=PIPE, stdout=PIPE, stderr=PIPE, universal_newlines=True)
>>> out, err = proc.communicate(input=idlist)
>>> err
'11:29:47.253 \x1b[31m[ERRO]\x1b[0m xopen: no content\n'
>>> out
''

Isn't that interesting! I'm unsure how or why this happens.

standage commented 2 years ago

The offending line is here:

https://github.com/bioforensics/pytaxonkit/blob/d24a1a5b6b295771c6485e3d5b951a0f66fce957/pytaxonkit.py#L439

It looks like it's a buffering issue: if there's no newline in the input string, something gets lost in the data transfer and taxonkit doesn't detect the input. Adding a second value fixes the buffering issue, as you discovered. I'm unsure why single taxids with multiple digits work but single digits don't: possibly buffering related as well?

In any case, adding a newline to the end of the input like so seems to fix the problem. PR incoming.

idlist = '\n'.join(map(str, ids)) + '\n'
philippbayer commented 2 years ago

Thanks!!! It might have to do with subprocess' buffering mode, with one number we have only one line - quote


0 means unbuffered (read and write are one system call and can return short)

1 means line buffered (only usable if universal_newlines=True i.e., in a text mode)```

Source

I guess the buffer's default size is larger than whatever space the numbers 1 to 9 take up, and adding a '\n' forces the line buffered mode? Anyways, it works now! Thank you so much, that was very fast

standage commented 2 years ago

Thanks for your patience while this slipped through the cracks for a few days. I've played around a bit with the bufsize parameter and wasn't able to get it to work with single character inputs, so I think we'll stick with the solution implemented in #26.