PNNL-CompBio / Snekmer

Pipeline to apply encoded Kmer analysis to protein sequences
BSD 3-Clause "New" or "Revised" License
12 stars 1 forks source link

windows- zipped files not detected/unzipped yet #60

Open abbyjerger opened 2 years ago

abbyjerger commented 2 years ago

In Windows, when running Snekmer with an input of the 4 files from the /resources/tutorial/demo_files/input folder (2 .faa and 2 .faa.gz files), the following message is given in the command line and in the log:

Building DAG of jobs... MissingInputException in line 46 of C:\Users\jerg881\Miniconda3\envs\snekmer\lib\site-packages\snekmer\rules\kmerize.smk: Missing input files for rule vectorize: input\NapB.faa

This output is given for all the commands "snekmer model --dryrun", "snekmer model", "snekmer cluster --dryrun", and "snekmer cluster". No changes to the directory are made. The 2 zipped input files remain zipped, and no output directory is generated.

christinehc commented 1 year ago

@abbyjerger Is this still an issue for you?

jjacobson95 commented 5 months ago

Just a status update - On windows, I am still receiving this error while testing on the 'background' branch.

Command, Error message, Files in directory:

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>snakemake --snakefile ..\Snekmer\snekmer\rules\learn.smk --cores=1 --configfiles=config.yaml
Building DAG of jobs...
MissingInputException in line 57 of C:\Users\jaco059\Desktop\snekmer_test\Snekmer\snekmer\rules\kmerize.smk:
Missing input files for rule vectorize:
input\UP000004358_314230.fasta

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>ls
annotations  config.yaml  input

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>ls input
UP000004358_314230.fasta.gz  UP000056630_1739114.fasta.gz  UP000198893_569882.fasta.gz  UP000199168_556533.fasta.gz  UP000305778_2571141.fasta.gz
UP000031546_45670.fasta.gz   UP000057938_361183.fasta.gz   UP000199134_645273.fasta.gz  UP000239203_155976.fasta.gz  UP000323646_2593411.fasta.gz

As I have a windows PC available, I'll look more into this. @christinehc how would you like me to commit changes once this is working? Should I do this directly on the background branch or elsewhere?

jjacobson95 commented 5 months ago

@christinehc @biodataganache I can see why this issue has been so elusive - there is a lot of code built around this that is difficult to reconcile across multiple smk scripts. I was wondering if there might be advantages to an alternative approach?

Instead of unzipping these files as their own step, we could enable the script to read gzipped files directly.

In kmerize.smk and we would simply replace fasta = SeqIO.parse(input.fasta, "fasta") with the following:

        if input.fasta.endswith('.gz'):
            fasta_handle = gzip.open(input.fasta, 'rt') 
        else:
            fasta_handle = open(input.fasta, 'r')
        fasta = SeqIO.parse(fasta_handle, "fasta")

       ...  
        fasta_handle.close()

In learn.smk and others, we would remove all of the UZ variables and scripts relating to unzipping. And we would replace the FA_MAP with the following code:

input_files = glob(join(input_dir, "*"))

FA_MAP = {
    f.split('.')[0]: '.'.join(f.split('.')[1:]) for f in (os.path.basename(x) for x in input_files)
}

This would create a dictionary with something like this as an output:
{'UP000004358_314230': 'fasta', 'UP000031546_45670': 'fasta.gz',...}

Advantages:

Disadvantages:

biodataganache commented 5 months ago

I suggest we (for now) drop gzipped file support. There’s no reason to have it other than convenience. We can continue to work on this issue as a development branch?

Jason McDermott, Ph.D. (he/him) Senior Research Scientist Pacific Northwest National Laboratory, MSIN: J4-18 902 Battelle Boulevard PO Box 999 Richland, Washington 99352 Phone: 509-372-4360 Fax : 509-371-6946 Email: @.**@.>

From: Jeremy Jacobson @.> Date: Friday, April 12, 2024 at 11:28 AM To: PNNL-CompBio/Snekmer @.> Cc: Mcdermott, Jason E @.>, Mention @.> Subject: Re: [PNNL-CompBio/Snekmer] windows- zipped files not detected/unzipped yet (Issue #60) Check twice before you click! This email originated from outside PNNL.

@christinehchttps://github.com/christinehc @biodataganachehttps://github.com/biodataganache I can see why this issue has been so elusive - there is a lot of code built around this that is difficult to reconcile across multiple smk scripts. I was wondering if there might be advantages to an alternative approach?

Instead of unzipping these files as their own step, we could enable the script to read gzipped files directly.

In kmerize.smk and we would simply replace fasta = SeqIO.parse(input.fasta, "fasta") with the following:

    if input.fasta.endswith('.gz'):

        fasta_handle = gzip.open(input.fasta, 'rt')

    else:

        fasta_handle = open(input.fasta, 'r')

    fasta = SeqIO.parse(fasta_handle, "fasta")

   ...

    fasta_handle.close()

In learn.smk and others, we would remove all of the UZ variables and scripts relating to unzipping. And we would replace the FA_MAP with the following code:

input_files = glob(join(input_dir, "*"))

FA_MAP = {

f.split('.')[0]: '.'.join(f.split('.')[1:]) for f in (os.path.basename(x) for x in input_files)

}

This would create a dictionary with something like this as an output: {'UP000004358_314230': 'fasta', 'UP000031546_45670': 'fasta.gz',...}

Advantages:

Disadvantages:

— Reply to this email directly, view it on GitHubhttps://github.com/PNNL-CompBio/Snekmer/issues/60#issuecomment-2052260835, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC5RUP7ZBVOIUBX3NZP655LY5ARUNAVCNFSM56GITZUKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBVGIZDMMBYGM2Q. You are receiving this because you were mentioned.Message ID: @.***>

christinehc commented 5 months ago

Just a status update - On windows, I am still receiving this error while testing on the 'background' branch.

Command, Error message, Files in directory:

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>snakemake --snakefile ..\Snekmer\snekmer\rules\learn.smk --cores=1 --configfiles=config.yaml
Building DAG of jobs...
MissingInputException in line 57 of C:\Users\jaco059\Desktop\snekmer_test\Snekmer\snekmer\rules\kmerize.smk:
Missing input files for rule vectorize:
input\UP000004358_314230.fasta

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>ls
annotations  config.yaml  input

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>ls input
UP000004358_314230.fasta.gz  UP000056630_1739114.fasta.gz  UP000198893_569882.fasta.gz  UP000199168_556533.fasta.gz  UP000305778_2571141.fasta.gz
UP000031546_45670.fasta.gz   UP000057938_361183.fasta.gz   UP000199134_645273.fasta.gz  UP000239203_155976.fasta.gz  UP000323646_2593411.fasta.gz

As I have a windows PC available, I'll look more into this. @christinehc how would you like me to commit changes once this is working? Should I do this directly on the background branch or elsewhere?

FYI the changes to gzipping have been implemented in model, search, and cluster modes but not to learn, apply, or motif modes, which have not been pulled into this branch yet. I changed the underlying code to use glob_wildcards rather than glob to pull files. Thus I would not expect unzipping to work with learn.smk, hence where the error is coming from

christinehc commented 5 months ago

@christinehc @biodataganache I can see why this issue has been so elusive - there is a lot of code built around this that is difficult to reconcile across multiple smk scripts. I was wondering if there might be advantages to an alternative approach?

Instead of unzipping these files as their own step, we could enable the script to read gzipped files directly.

In kmerize.smk and we would simply replace fasta = SeqIO.parse(input.fasta, "fasta") with the following:

        if input.fasta.endswith('.gz'):
            fasta_handle = gzip.open(input.fasta, 'rt') 
        else:
            fasta_handle = open(input.fasta, 'r')
        fasta = SeqIO.parse(fasta_handle, "fasta")

       ...  
        fasta_handle.close()

In learn.smk and others, we would remove all of the UZ variables and scripts relating to unzipping. And we would replace the FA_MAP with the following code:

input_files = glob(join(input_dir, "*"))

FA_MAP = {
    f.split('.')[0]: '.'.join(f.split('.')[1:]) for f in (os.path.basename(x) for x in input_files)
}

This would create a dictionary with something like this as an output: {'UP000004358_314230': 'fasta', 'UP000031546_45670': 'fasta.gz',...}

Advantages:

  • Simpler
  • Less code to maintain long term
  • There should be no differences between Mac, Windows and Linux
  • No duplicate files created in a zipped directory within input. (less overall storage space used + files aren't unzipped)

Disadvantages:

  • Removal and updating current code.
  • Maybe speed changes?

The initial reason why I didn't use a similar if/else to handle file unzipping is because these can complicate Snakemake's understanding of how to handle files, hence a higher level rule to optionally handle gzipped files. I would try testing the API changes for file getting (see syntax in model.smk lines 38-69: https://github.com/PNNL-CompBio/Snekmer/blob/background/snekmer/rules/model.smk#L38) and then unzipping (model.smk lines 142-152: https://github.com/PNNL-CompBio/Snekmer/blob/background/snekmer/rules/model.smk#L142) on snekmer learn/apply and see if those changes work. Before doing that, I would test snekmer model/cluster/search on Windows to answer the original question of whether the new unzipping code works on Windows systems works.