JonnyTran / OpenOmics

A bioinformatics API to interface with public multi-omics bio databases for wicked fast data integration.
https://openomics.readthedocs.io/en/latest/
MIT License
31 stars 13 forks source link

Reviewer 2 - Automated tests - Errors with importing dataset in generate_MiRTarBase test #115

Closed JonnyTran closed 3 years ago

JonnyTran commented 3 years ago

Description

Running MiRTarBase(path="/data/datasets/Bioinformatics_ExternalData/miRTarBase/") in the generate_MiRTarBase test results in an error.

What I Did

Some tests are available and run as part of the Travis CI pipeline, though coverage isn't amazing and would benefit from additional work. Focusing on test driven development is a good way to ensure greater coverage. Running the tests locally took a long time, and returned various warnings and an error.

☁ OpenOmics [master] python -m pytest --cov=./
============================================================================================================ test session starts =============================================================================================================
platform darwin -- Python 3.8.7, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /Users/stephenmoss/Dropbox/Code/OpenOmics, configfile: setup.cfg
plugins: cov-2.11.1, dash-1.19.0
collected 35 items

tests/test_annotations.py .........                                                                                                                                                                                                    [ 25%]
tests/test_disease.py ..........                                                                               [ 54%]
tests/test_interaction.py .E.....                                                                              [ 74%]
tests/test_multiomics.py ...                                                                                   [ 82%]
tests/test_sequences.py ......                                                                                 [100%]

======================================================= ERRORS =======================================================
______________________________________ ERROR at setup of test_import_MiRTarBase ______________________________________

    @pytest.fixture
    def generate_MiRTarBase():
>       return MiRTarBase(path="/data/datasets/Bioinformatics_ExternalData/miRTarBase/", strip_mirna_name=True,
                          filters={"Species (Target Gene)": "Homo sapiens"})

tests/test_interaction.py:19:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openomics/database/interaction.py:611: in __init__
    super(MiRTarBase, self).__init__(path=path, file_resources=file_resources,
openomics/database/interaction.py:40: in __init__
    self.validate_file_resources(path, file_resources, verbose=verbose)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <openomics.database.interaction.MiRTarBase object at 0x1aa5036d0>
path = '/data/datasets/Bioinformatics_ExternalData/miRTarBase/'
file_resources = {'miRTarBase_MTI.xlsx': '/data/datasets/Bioinformatics_ExternalData/miRTarBase/miRTarBase_MTI.xlsx'}
npartitions = None, verbose = False

    def validate_file_resources(self, path, file_resources, npartitions=None, verbose=False) -> None:
        """For each file in file_resources, fetch the file if path+file is a URL
        or load from disk if a local path. Additionally unzip or unrar if the
        file is compressed.

        Args:
            path (str): The folder or url path containing the data file
                resources. If url path, the files will be downloaded and cached
                to the user's home folder (at ~/.astropy/).
            file_resources (dict): default None, Used to list required files for
                preprocessing of the database. A dictionary where keys are
                required filenames and value are file paths. If None, then the
                class constructor should automatically build the required file
                resources dict.
            npartitions:
            verbose:
        """
        if validators.url(path):
            for filename, filepath in copy.copy(file_resources).items():
                data_file = get_pkg_data_filename(path, filepath,
                                                  verbose=verbose)  # Download file and replace the file_resource path
                filetype_ext = filetype.guess(data_file)

                # This null if-clause is needed incase when filetype_ext is None, causing the next clause to fail
                if filetype_ext is None:
                    file_resources[filename] = data_file

                elif filetype_ext.extension == 'gz':
                    file_resources[filename] = gzip.open(data_file, 'rt')

                elif filetype_ext.extension == 'zip':
                    zf = zipfile.ZipFile(data_file, 'r')

                    for subfile in zf.infolist():
                        if os.path.splitext(subfile.filename)[-1] == os.path.splitext(filename)[-1]: # If the file extension matches
                            file_resources[filename] = zf.open(subfile.filename, mode='r')

                elif filetype_ext.extension == 'rar':
                    rf = rarfile.RarFile(data_file, 'r')

                    for subfile in rf.infolist():
                        if os.path.splitext(subfile.filename)[-1] == os.path.splitext(filename)[-1]: # If the file extension matches
                            file_resources[filename] = rf.open(subfile.filename, mode='r')
                else:
                    file_resources[filename] = data_file

        elif os.path.isdir(path) and os.path.exists(path):
            for _, filepath in file_resources.items():
                if not os.path.exists(filepath):
                    raise IOError(filepath)
        else:
>           raise IOError(path)
E           OSError: /data/datasets/Bioinformatics_ExternalData/miRTarBase/

openomics/database/base.py:113: OSError
================================================== warnings summary ==================================================
../../../.pyenv/versions/3.8.7/lib/python3.8/site-packages/_pytest/config/__init__.py:1233
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/_pytest/config/__init__.py:1233: PytestConfigWarning: Unknown config option: collect_ignore

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

tests/test_annotations.py: 6 warnings
tests/test_disease.py: 6 warnings
tests/test_interaction.py: 4 warnings
tests/test_multiomics.py: 3 warnings
tests/test_sequences.py: 4 warnings
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/transcriptomics.py:108: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
    df = pd.read_table(data, sep=None)

tests/test_annotations.py::test_import_GTEx
tests/test_annotations.py::test_GTEx_annotate
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/annotation.py:239: FutureWarning: The default value of regex will change from True to False in a future version.
    gene_exp_medians["Name"] = gene_exp_medians["Name"].str.replace("[.].*", "")

tests/test_disease.py::test_import_HMDD
tests/test_disease.py::test_annotate_HMDD
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/encodings/unicode_escape.py:26: DeprecationWarning: invalid escape sequence '\ '
    return codecs.unicode_escape_decode(input, self.errors)[0]

tests/test_disease.py::test_import_HMDD
tests/test_disease.py::test_annotate_HMDD
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/encodings/unicode_escape.py:26: DeprecationWarning: invalid escape sequence '\s'
    return codecs.unicode_escape_decode(input, self.errors)[0]

tests/test_interaction.py::test_import_LncRNA2Target
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/interaction.py:476: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
    table = pd.read_excel(file_resources["lncRNA_target_from_low_throughput_experiments.xlsx"])

tests/test_interaction.py::test_import_LncRNA2Target
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/interaction.py:480: FutureWarning: The default value of regex will change from True to False in a future version.
    table["Target_official_symbol"] = table["Target_official_symbol"].str.replace("(?i)(mir)", "hsa-mir-")

-- Docs: https://docs.pytest.org/en/stable/warnings.html

---------- coverage: platform darwin, python 3.8.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
openomics/__init__.py                         25      7    72%
openomics/clinical.py                         57     19    67%
openomics/database/__init__.py                 7      0   100%
openomics/database/annotation.py             197    110    44%
openomics/database/base.py                   140     37    74%
openomics/database/disease.py                 61      1    98%
openomics/database/interaction.py            330    196    41%
openomics/database/ontology.py               152     93    39%
openomics/database/sequence.py               128     66    48%
openomics/genomics.py                         26      8    69%
openomics/imageomics.py                       63     47    25%
openomics/multicohorts.py                      0      0   100%
openomics/multiomics.py                      111     66    41%
openomics/proteomics.py                       13      2    85%
openomics/transcriptomics.py                 111     32    71%
openomics/utils/GTF.py                        53     53     0%
openomics/utils/__init__.py                    0      0   100%
openomics/utils/df.py                         23     11    52%
openomics/utils/io.py                         40     19    52%
openomics/utils/read_gtf.py                  107     24    78%
openomics/visualization/__init__.py            1      0   100%
openomics/visualization/heatmat.py            11      8    27%
openomics/visualization/umap.py               29     24    17%
openomics_web/__init__.py                      0      0   100%
openomics_web/app.py                          69     69     0%
openomics_web/callbacks.py                     0      0   100%
openomics_web/layouts/__init__.py              0      0   100%
openomics_web/layouts/annotation_view.py       0      0   100%
openomics_web/layouts/app_layout.py            7      7     0%
openomics_web/layouts/clinical_view.py        10     10     0%
openomics_web/layouts/control_tabs.py          5      5     0%
openomics_web/layouts/datatable_view.py       28     28     0%
openomics_web/server.py                        2      2     0%
openomics_web/utils/__init__.py                0      0   100%
openomics_web/utils/io.py                     62     62     0%
openomics_web/utils/str_utils.py              25     25     0%
setup.py                                      44     44     0%
tests/__init__.py                              0      0   100%
tests/data/__init__.py                         0      0   100%
tests/data/test_dask_dataframes.py             0      0   100%
tests/test_annotations.py                     39      0   100%
tests/test_disease.py                         34      0   100%
tests/test_interaction.py                     20      1    95%
tests/test_multiomics.py                      46      1    98%
tests/test_sequences.py                       18      1    94%
--------------------------------------------------------------
TOTAL                                       2094   1078    49%

============================================== short test summary info ===============================================
ERROR tests/test_interaction.py::test_import_MiRTarBase - OSError: /data/datasets/Bioinformatics_ExternalData/miRTa...
================================ 34 passed, 32 warnings, 1 error in 662.47s (0:11:02) ================================

The main error seemed to be a missing dataset. On further inspection of the codebase it seems that the package is supposed to download the miRTarBase data (although it appears to have the version 7 release URL hardcoded when version 8 is now available). I wondered whether this was a permissions issue with not being able to create the /data/datasets/Bioinformatics_ExternalData/miRTarBase/ path on my system. I tried with sudo python -m pytest --cov=./ tests/test_interaction.py and got the same? I tried sudo mkdir -p /data/datasets/Bioinformatics_ExternalData/miRTarBase beforehand, which returned:

mkdir: /data/datasets/Bioinformatics_ExternalData/miRTarBase: Read-only file system This is likely due to macOS system integrity protection.

However, it seems I am also unable to resolve http://mirtarbase.mbc.nctu.edu.tw/cache/download/7.0/. I believe the URL should actually be http://mirtarbase.cuhk.edu.cn/cache/download/7.0/ (or even http://mirtarbase.cuhk.edu.cn/cache/download/8.0/)? I manually updated to the working version 7.0 release and updated the path in test_interaction.py before running the following:

mkdir -p tests/data/datasets/Bioinformatics_ExternalData/miRTarBase sudo python -m pytest --cov=./ tests/test_interaction.py I still received the error, so something needs looking at in more detail here.

JonnyTran commented 3 years ago

The path at MiRTarBase(path="/data/datasets/Bioinformatics_ExternalData/miRTarBase/") is hard-coded to the directory at my local computer.

There are many bioinformatics datasets that are not publicly available via direct download (e.g. lncbase) and may require access permission. In that case, it is expected that the users would obtain the data themself then provide a path to their own local directory.

Since currently, it is difficult to get a reliable ftp download connection from the server at http://mirtarbase.mbc.nctu.edu.tw, I suggest we remove test_import_MiRTarBase from the automated test set that relies on ftp download.

JonnyTran commented 3 years ago

Removed the test_import_MiRTarBase automated test since MiRTarBase is not obtainable via ftp.

gawbul commented 3 years ago

Tests all pass, though a few warnings:

$ OpenOmics [master] python -m pytest
============================================================================================================ test session starts =============================================================================================================
platform darwin -- Python 3.8.9, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
rootdir: /Users/stephenmoss/Dropbox/Code/OpenOmics, configfile: setup.cfg
plugins: dash-1.20.0
collected 38 items

tests/test_annotations.py .........                                                                                                                                                                                                    [ 23%]
tests/test_disease.py ..........                                                                                                                                                                                                       [ 50%]
tests/test_interaction.py .......                                                                                                                                                                                                      [ 68%]
tests/test_multiomics.py ...                                                                                                                                                                                                           [ 76%]
tests/test_sequences.py .........                                                                                                                                                                                                      [100%]

============================================================================================================== warnings summary ==============================================================================================================
../../../.pyenv/versions/3.8.9/lib/python3.8/site-packages/_pytest/config/__init__.py:1233
  /Users/stephenmoss/.pyenv/versions/3.8.9/lib/python3.8/site-packages/_pytest/config/__init__.py:1233: PytestConfigWarning: Unknown config option: collect_ignore

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

openomics/database/ontology.py:235
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/ontology.py:235: DeprecationWarning: invalid escape sequence \|
    and annotation.str.contains("\||;", regex=True).any()):

tests/test_disease.py::test_import_HMDD
tests/test_disease.py::test_annotate_HMDD
  /Users/stephenmoss/.pyenv/versions/3.8.9/lib/python3.8/encodings/unicode_escape.py:26: DeprecationWarning: invalid escape sequence '\ '
    return codecs.unicode_escape_decode(input, self.errors)[0]

tests/test_disease.py::test_import_HMDD
tests/test_disease.py::test_annotate_HMDD
  /Users/stephenmoss/.pyenv/versions/3.8.9/lib/python3.8/encodings/unicode_escape.py:26: DeprecationWarning: invalid escape sequence '\s'
    return codecs.unicode_escape_decode(input, self.errors)[0]

tests/test_interaction.py::test_import_LncRNA2Target
tests/test_interaction.py::test_get_interactions_lnc2target
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/interaction.py:479: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
    table = pd.read_excel(file_resources["lncRNA_target_from_low_throughput_experiments.xlsx"])

-- Docs: https://docs.pytest.org/en/stable/warnings.html
================================================================================================= 38 passed, 8 warnings in 603.78s (0:10:03) =================================================================================================