Closed gregcaporaso closed 3 years ago
Note that explicitly closing fh or using a context manager results in the file being closed too early
Wow, even if you call fh.close()
after the skbio.io.read
call?
PS - this plugin will need a dev-bump in order to get CI passing.
Wow, even if you call fh.close() after the skbio.io.read call?
Yep, just confirmed again locally. This is one of the four test failures that I get in that case (all result in the same ValueError):
__________ GISAIDDNAFASTAFormatTransformerTests.test_transformer_sequence_exclusion_last_record ___________
self = <genome_sampler.tests.test_transformers.GISAIDDNAFASTAFormatTransformerTests testMethod=test_transformer_sequence_exclusion_last_record>
def test_transformer_sequence_exclusion_last_record(self):
input, obs = self.transform_format(GISAIDDNAFASTAFormat,
DNASequencesDirectoryFormat,
> filename='gisaid4.fasta')
genome_sampler/tests/test_transformers.py:79:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/qiime2/plugin/testing.py:244: in transform_format
obs = transformer(input)
genome_sampler/plugin_setup.py:174: in _4
skbio.io.write(data, format='fasta', into=file)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1166: in write
return io_registry.write(obj, format, into, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:619: in write
writer(obj, into, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1082: in wrapped_writer
writer_function(obj, fhs[-1], **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:774: in _generator_to_fasta
for header, seq_str, qual_scores in formatted_records:
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/_base.py:147: in _format_fasta_like_records
for idx, seq in enumerate(generator):
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:506: in <genexpr>
return (x for x in itertools.chain([next(gen)], gen))
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:531: in _read_gen
yield from reader(file, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1008: in wrapped_reader
yield from reader_function(fhs[-1], **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:675: in _fasta_to_generator
FASTAFormatError):
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:853: in _parse_fasta_raw
for line in _line_generator(fh, skip_blanks=False):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
fh = <_io.TextIOWrapper name=13 mode='w+' encoding='UTF-8'>, skip_blanks = False, strip = True
def _line_generator(fh, skip_blanks=False, strip=True):
> for line in fh:
E ValueError: I/O operation on closed file.
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/_base.py:193: ValueError
------------------------------------------ Captured stderr call -------------------------------------------
I think that makes sense, as io.read
will be returning a generator (that then gets written), so it isn't going to read from the file right away, if close is called, then io.read
won't have anything to look at when it is consumed by io.write
.
This is all pretty confusing, but I like the temp-file approach. The extra disk-IO is a bummer, but it beats the memory requirements for the original.
The only other alternative would be to create the skbio.DNA objects in the generator (by re-implementing io.read
for fasta) which doesn't sound ideal.
Oh nice, yeah I didn't realize that was consuming a generator - makes sense!
Another approach would be to return the fh
from the helper and then close it after io.write
in the transformer, then you don't have to "rely" on the garbage collection.
So this might be a silly question, and I'm not super familiar with this part of the code in this project, but why not just refactor _cleanup_gen
to directly write to a new tempfile, line by line, and then pass that new tempfile into the final skbio read?
diff --git a/genome_sampler/plugin_setup.py b/genome_sampler/plugin_setup.py
index b7f4ec4..9ecb9fe 100644
--- a/genome_sampler/plugin_setup.py
+++ b/genome_sampler/plugin_setup.py
@@ -112,47 +112,49 @@ def _3(fmt: IDSelectionDirFmt) -> IDSelection:
def _read_gisaid_dna_fasta(path):
- def _cleanup_gen():
- with open(path) as input_f:
- lines = None
- for line in input_f:
- if line.startswith('>'):
- if lines is not None:
- yield from lines
- lines = [line]
- elif lines is not None:
- # Due to a bug in skbio 0.5.5, the lowercase option can't
- # be used with skbio.io.read for reading DNA sequences.
- # Convert sequences to uppercase here.
- line = line.upper()
- # Spaces and gap characters can appear in unaligned GISAID
- # sequence records, so we strip those. U characters are
- # additionally replaced with T characters.
- line = line.replace('-', '')
- line = line.replace('.', '')
- line = line.replace(' ', '')
- line = line.replace('U', 'T')
-
- observed_chars = set(line.strip())
- disallowed_chars = observed_chars - skbio.DNA.alphabet
- if disallowed_chars:
- # Spaces and gap characters can appear in unaligned GISAID
- # sequence records, so we strip those. U characters are
- # additionally replaced with T characters.
- line = line.replace('-', '')
- line = line.replace('.', '')
- line = line.replace(' ', '')
- line = line.replace('U', 'T')
-
- observed_chars = set(line.strip())
- disallowed_chars = observed_chars - skbio.DNA.alphabet
- if disallowed_chars:
- print('Note: Non-IUPAC DNA characters (%s) in '
- 'sequence record %s. This record will be '
- 'excluded from the output.' %
- (' '.join(disallowed_chars),
- lines[0][1:].split()[0]),
- file=sys.stderr)
- lines = None
- else:
- lines.append(line)
+ with open(path) as input_f, \
+ with tempfile.TemporaryFile(mode='w+') as output_f:
+ lines = None
+ for line in input_f:
+ if line.startswith('>'):
+ if lines is not None:
+ for l in lines:
+ output_f.write(l)
+ lines = [line]
+ elif lines is not None:
+ # Due to a bug in skbio 0.5.5, the lowercase option can't
+ # be used with skbio.io.read for reading DNA sequences.
+ # Convert sequences to uppercase here.
+ line = line.upper()
+ # Spaces and gap characters can appear in unaligned GISAID
+ # sequence records, so we strip those. U characters are
+ # additionally replaced with T characters.
+ line = line.replace('-', '')
+ line = line.replace('.', '')
+ line = line.replace(' ', '')
+ line = line.replace('U', 'T')
+
+ observed_chars = set(line.strip())
+ disallowed_chars = observed_chars - skbio.DNA.alphabet
+ if disallowed_chars:
+ print('Note: Non-IUPAC DNA characters (%s) in '
+ 'sequence record %s. This record will be '
+ 'excluded from the output.' %
+ (' '.join(disallowed_chars),
+ lines[0][1:].split()[0]),
+ file=sys.stderr)
+ lines = None
else:
- continue
+ lines.append(line)
+ else:
+ continue
- if lines is not None:
- yield from lines
+ if lines is not None:
+ for l in lines:
+ output_f.write(l)
- result = skbio.io.read(_cleanup_gen(), verify=False,
- format='fasta', constructor=skbio.DNA)
+ result = skbio.io.read(output_f, verify=False,
+ format='fasta', constructor=skbio.DNA)
return result
The patch above is untested, but basically just gets rid of the closure and writes each "cleaned up" line to the tempfile. Might be a little easier to reason about in the code, but again, I might be missing something.
@thermokarst, your suggestion still runs into the file I/O on a closed file (at least if I got it right when testing locally):
def _read_gisaid_dna_fasta(path):
with open(path) as input_f, \
tempfile.TemporaryFile(mode='w+') as output_f:
lines = None
for line in input_f:
if line.startswith('>'):
if lines is not None:
output_f.writelines(lines)
lines = [line]
elif lines is not None:
# Due to a bug in skbio 0.5.5, the lowercase option can't
# be used with skbio.io.read for reading DNA sequences.
# Convert sequences to uppercase here.
line = line.upper()
# Spaces and gap characters can appear in unaligned GISAID
# sequence records, so we strip those. U characters are
# additionally replaced with T characters.
line = line.replace('-', '')
line = line.replace('.', '')
line = line.replace(' ', '')
line = line.replace('U', 'T')
observed_chars = set(line.strip())
disallowed_chars = observed_chars - skbio.DNA.alphabet
if disallowed_chars:
print('Note: Non-IUPAC DNA characters (%s) in '
'sequence record %s. This record will be '
'excluded from the output.' %
(' '.join(disallowed_chars),
lines[0][1:].split()[0]),
file=sys.stderr)
lines = None
else:
lines.append(line)
else:
continue
if lines is not None:
output_f.writelines(lines)
output_f.seek(0)
result = skbio.io.read(output_f, verify=False,
format='fasta', constructor=skbio.DNA)
return result
That should work if the helper is inlined into the transformer. Then the with
statement can have io.write
inside of it.
Solved it in this latest commit in a different way (@ebolyen, I think this is what you were going for when we discussed in Basecamp yesterday). This seems better than my initial version, since the tempfile is being managed with a context manager.
PS - this plugin will need a dev-bump in order to get CI passing.
@thermokarst, are the instructions for doing this the ones in RELEASING.md in this repo?
@thermokarst, are the instructions for doing this the ones in RELEASING.md in this repo?
I don't think so - I believe that document is currently focused on discussing cutting a new production release, not cutting a development version. I might need to think about this for a minute.
Okay, I think I remember what we need to do, I'll do it right now, and update the RELEASE document.
@thermokarst, your suggestion still runs into the file I/O on a closed file (at least if I got it right when testing locally):
I misunderstood where the issue you were running into was occurring - I thought you were bumping into issues inside the helper, not inside the transformer. It all makes sense to me now, sorry for sending you off on a tangent.
I dev bumped the repo, and am waiting on CI to finish. Once green I'll update you here.
Okay, dev bump was successful, and CI actions are passing here in this PR. Please note, I pushed up a few commits to your feature branch to get things green.
Awesome, thanks for the help @thermokarst and @ebolyen!
This does address the memory issue but is probably not the most elegant approach. Note that explicitly closing
fh
or using a context manager results in the file being closed too early, so I'm relying ontempfile
to ensure thatfh
is closed. Open to suggestions on how to better address this.