address memory issue with large GISAID files

gregcaporaso commented 3 years ago

This does address the memory issue but is probably not the most elegant approach. Note that explicitly closing fh or using a context manager results in the file being closed too early, so I'm relying on tempfile to ensure that fh is closed. Open to suggestions on how to better address this.

thermokarst commented 3 years ago

Note that explicitly closing fh or using a context manager results in the file being closed too early

Wow, even if you call fh.close() after the skbio.io.read call?

thermokarst commented 3 years ago

PS - this plugin will need a dev-bump in order to get CI passing.

gregcaporaso commented 3 years ago

Wow, even if you call fh.close() after the skbio.io.read call?

Yep, just confirmed again locally. This is one of the four test failures that I get in that case (all result in the same ValueError):

__________ GISAIDDNAFASTAFormatTransformerTests.test_transformer_sequence_exclusion_last_record ___________

self = <genome_sampler.tests.test_transformers.GISAIDDNAFASTAFormatTransformerTests testMethod=test_transformer_sequence_exclusion_last_record>

    def test_transformer_sequence_exclusion_last_record(self):
        input, obs = self.transform_format(GISAIDDNAFASTAFormat,
                                           DNASequencesDirectoryFormat,
>                                          filename='gisaid4.fasta')

genome_sampler/tests/test_transformers.py:79:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/qiime2/plugin/testing.py:244: in transform_format
    obs = transformer(input)
genome_sampler/plugin_setup.py:174: in _4
    skbio.io.write(data, format='fasta', into=file)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1166: in write
    return io_registry.write(obj, format, into, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:619: in write
    writer(obj, into, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1082: in wrapped_writer
    writer_function(obj, fhs[-1], **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:774: in _generator_to_fasta
    for header, seq_str, qual_scores in formatted_records:
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/_base.py:147: in _format_fasta_like_records
    for idx, seq in enumerate(generator):
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:506: in <genexpr>
    return (x for x in itertools.chain([next(gen)], gen))
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:531: in _read_gen
    yield from reader(file, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1008: in wrapped_reader
    yield from reader_function(fhs[-1], **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:675: in _fasta_to_generator
    FASTAFormatError):
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:853: in _parse_fasta_raw
    for line in _line_generator(fh, skip_blanks=False):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

fh = <_io.TextIOWrapper name=13 mode='w+' encoding='UTF-8'>, skip_blanks = False, strip = True

    def _line_generator(fh, skip_blanks=False, strip=True):
>       for line in fh:
E       ValueError: I/O operation on closed file.

../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/_base.py:193: ValueError
------------------------------------------ Captured stderr call -------------------------------------------

ebolyen commented 3 years ago

I think that makes sense, as io.read will be returning a generator (that then gets written), so it isn't going to read from the file right away, if close is called, then io.read won't have anything to look at when it is consumed by io.write.

This is all pretty confusing, but I like the temp-file approach. The extra disk-IO is a bummer, but it beats the memory requirements for the original.

The only other alternative would be to create the skbio.DNA objects in the generator (by re-implementing io.read for fasta) which doesn't sound ideal.

thermokarst commented 3 years ago

Oh nice, yeah I didn't realize that was consuming a generator - makes sense!

ebolyen commented 3 years ago

Another approach would be to return the fh from the helper and then close it after io.write in the transformer, then you don't have to "rely" on the garbage collection.

thermokarst commented 3 years ago

So this might be a silly question, and I'm not super familiar with this part of the code in this project, but why not just refactor _cleanup_gen to directly write to a new tempfile, line by line, and then pass that new tempfile into the final skbio read?

diff --git a/genome_sampler/plugin_setup.py b/genome_sampler/plugin_setup.py
index b7f4ec4..9ecb9fe 100644
--- a/genome_sampler/plugin_setup.py
+++ b/genome_sampler/plugin_setup.py
@@ -112,47 +112,49 @@ def _3(fmt: IDSelectionDirFmt) -> IDSelection:

 def _read_gisaid_dna_fasta(path):
-    def _cleanup_gen():
-        with open(path) as input_f:
-            lines = None
-            for line in input_f:
-                if line.startswith('>'):
-                    if lines is not None:
-                        yield from lines
-                    lines = [line]
-                elif lines is not None:
-                    # Due to a bug in skbio 0.5.5, the lowercase option can't
-                    # be used with skbio.io.read for reading DNA sequences.
-                    # Convert sequences to uppercase here.
-                    line = line.upper()
-                    # Spaces and gap characters can appear in unaligned GISAID
-                    # sequence records, so we strip those. U characters are
-                    # additionally replaced with T characters.
-                    line = line.replace('-', '')
-                    line = line.replace('.', '')
-                    line = line.replace(' ', '')
-                    line = line.replace('U', 'T')
-
-                    observed_chars = set(line.strip())
-                    disallowed_chars = observed_chars - skbio.DNA.alphabet
-                    if disallowed_chars:
-                    # Spaces and gap characters can appear in unaligned GISAID
-                    # sequence records, so we strip those. U characters are
-                    # additionally replaced with T characters.
-                    line = line.replace('-', '')
-                    line = line.replace('.', '')
-                    line = line.replace(' ', '')
-                    line = line.replace('U', 'T')
-
-                    observed_chars = set(line.strip())
-                    disallowed_chars = observed_chars - skbio.DNA.alphabet
-                    if disallowed_chars:
-                        print('Note: Non-IUPAC DNA characters (%s) in '
-                              'sequence record %s. This record will be '
-                              'excluded from the output.' %
-                              (' '.join(disallowed_chars),
-                               lines[0][1:].split()[0]),
-                              file=sys.stderr)
-                        lines = None
-                    else:
-                        lines.append(line)
+    with open(path) as input_f, \
+            with tempfile.TemporaryFile(mode='w+') as output_f:
+        lines = None
+        for line in input_f:
+            if line.startswith('>'):
+                if lines is not None:
+                    for l in lines:
+                        output_f.write(l)
+                lines = [line]
+            elif lines is not None:
+                # Due to a bug in skbio 0.5.5, the lowercase option can't
+                # be used with skbio.io.read for reading DNA sequences.
+                # Convert sequences to uppercase here.
+                line = line.upper()
+                # Spaces and gap characters can appear in unaligned GISAID
+                # sequence records, so we strip those. U characters are
+                # additionally replaced with T characters.
+                line = line.replace('-', '')
+                line = line.replace('.', '')
+                line = line.replace(' ', '')
+                line = line.replace('U', 'T')
+
+                observed_chars = set(line.strip())
+                disallowed_chars = observed_chars - skbio.DNA.alphabet
+                if disallowed_chars:
+                    print('Note: Non-IUPAC DNA characters (%s) in '
+                          'sequence record %s. This record will be '
+                          'excluded from the output.' %
+                          (' '.join(disallowed_chars),
+                           lines[0][1:].split()[0]),
+                          file=sys.stderr)
+                    lines = None
                 else:
-                    continue
+                    lines.append(line)
+            else:
+                continue

-            if lines is not None:
-                yield from lines
+        if lines is not None:
+            for l in lines:
+                output_f.write(l)

-    result = skbio.io.read(_cleanup_gen(), verify=False,
-                           format='fasta', constructor=skbio.DNA)
+        result = skbio.io.read(output_f, verify=False,
+                               format='fasta', constructor=skbio.DNA)
     return result

The patch above is untested, but basically just gets rid of the closure and writes each "cleaned up" line to the tempfile. Might be a little easier to reason about in the code, but again, I might be missing something.

gregcaporaso commented 3 years ago

@thermokarst, your suggestion still runs into the file I/O on a closed file (at least if I got it right when testing locally):

def _read_gisaid_dna_fasta(path):
    with open(path) as input_f, \
         tempfile.TemporaryFile(mode='w+') as output_f:
        lines = None
        for line in input_f:
            if line.startswith('>'):
                if lines is not None:
                    output_f.writelines(lines)
                lines = [line]
            elif lines is not None:
                # Due to a bug in skbio 0.5.5, the lowercase option can't
                # be used with skbio.io.read for reading DNA sequences.
                # Convert sequences to uppercase here.
                line = line.upper()
                # Spaces and gap characters can appear in unaligned GISAID
                # sequence records, so we strip those. U characters are
                # additionally replaced with T characters.
                line = line.replace('-', '')
                line = line.replace('.', '')
                line = line.replace(' ', '')
                line = line.replace('U', 'T')

                observed_chars = set(line.strip())
                disallowed_chars = observed_chars - skbio.DNA.alphabet
                if disallowed_chars:
                    print('Note: Non-IUPAC DNA characters (%s) in '
                            'sequence record %s. This record will be '
                            'excluded from the output.' %
                            (' '.join(disallowed_chars),
                            lines[0][1:].split()[0]),
                            file=sys.stderr)
                    lines = None
                else:
                    lines.append(line)
            else:
                continue

        if lines is not None:
            output_f.writelines(lines)

        output_f.seek(0)
        result = skbio.io.read(output_f, verify=False,
                            format='fasta', constructor=skbio.DNA)
    return result

ebolyen commented 3 years ago

That should work if the helper is inlined into the transformer. Then the with statement can have io.write inside of it.

gregcaporaso commented 3 years ago

Solved it in this latest commit in a different way (@ebolyen, I think this is what you were going for when we discussed in Basecamp yesterday). This seems better than my initial version, since the tempfile is being managed with a context manager.

PS - this plugin will need a dev-bump in order to get CI passing.

@thermokarst, are the instructions for doing this the ones in RELEASING.md in this repo?

thermokarst commented 3 years ago

@thermokarst, are the instructions for doing this the ones in RELEASING.md in this repo?

I don't think so - I believe that document is currently focused on discussing cutting a new production release, not cutting a development version. I might need to think about this for a minute.

thermokarst commented 3 years ago

Okay, I think I remember what we need to do, I'll do it right now, and update the RELEASE document.

thermokarst commented 3 years ago

@thermokarst, your suggestion still runs into the file I/O on a closed file (at least if I got it right when testing locally):

I misunderstood where the issue you were running into was occurring - I thought you were bumping into issues inside the helper, not inside the transformer. It all makes sense to me now, sorry for sending you off on a tangent.

I dev bumped the repo, and am waiting on CI to finish. Once green I'll update you here.

thermokarst commented 3 years ago

Okay, dev bump was successful, and CI actions are passing here in this PR. Please note, I pushed up a few commits to your feature branch to get things green.

gregcaporaso commented 3 years ago

Awesome, thanks for the help @thermokarst and @ebolyen!

caporaso-lab / genome-sampler

address memory issue with large GISAID files #94