althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
129 stars 12 forks source link

ValueError: Could not determine format of file: '/dbfs/mnt/LuxC.sto' #10

Open lzhangUT opened 3 years ago

lzhangUT commented 3 years ago

Hi, I was following your tutorial of Multiple sequence alignment (mas) to HMM. I have downloaded your example data into my working directory. and I can see the two files (LuxC.faa and LuxC.sto) there as this: [FileInfo(path='dbfs:/mnt/LuxC.faa', name='LuxC.faa', size=153510), FileInfo(path='dbfs:/mnt/LuxC.sto', name='LuxC.sto', size=150686),

when I tried to run this code:

with pyhmmer.easel.MSAFile("/dbfs/mnt/LuxC.sto") as msa_file:
    msa_file.set_digital(alphabet)
    msa = next(msa_file)

It gives me error like this: ValueError: Could not determine format of file: '/dbfs/mnt/LuxC.sto'

I am not sure where it went wrong, the installation and the first two commands in the tutorial works fine. Thanks for your help

lzhangUT commented 3 years ago

however, if I manually copy all the content and create the file and save into my working directory, the files seem to be working, the error was gone.

but I have another issue when running the following code:

with pyhmmer.easel.SequenceFile("/dbfs/mnt/alphafold/LuxC.faa") as seq_file:
  seq_file.set_digital(alphabet)
  sequences = list(seq_file)

pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background)
hits = pipeline.search_hmm(query=hmm, sequences=sequences)
ValueError: Could not parse file: Line 2: illegal character -
althonos commented 3 years ago

Hi @lzhangUT ,

In the first snippet, I am not sure what is going wrong, but you can always manually set the file type to "stockholm" since it looks like Easel doesn't find the format properly:

with pyhmmer.easel.MSAFile("/dbfs/mnt/LuxC.sto", format="stockholm") as msa_file:
    msa_file.set_digital(alphabet)
    msa = next(msa_file)

In the second one, I suppose it's because you are trying to read a multiple alignment file, and by default using a SequenceFile on those will fail. You need to manually allow the gaps:

with pyhmmer.easel.SequenceFile("/dbfs/mnt/alphafold/LuxC.faa", ignore_gaps=True) as seq_file:
  seq_file.set_digital(alphabet)
  sequences = list(seq_file)
lzhangUT commented 3 years ago

Hi @althonos , Thanks for your response. first of all, I think LuxC.faa is a fasta file, i.e.,a sequence file, not a multiple alignment file here. second, I was following the tutorial on your github, and the data is from your github as well. Even after I add the code 'ignore_gaps=True', the same error is still there.

with pyhmmer.easel.SequenceFile("/dbfs/mnt/alphafold/LuxC.faa", ignore_gaps=True) as seq_file: seq_file.set_digital(alphabet) sequences = list(seq_file)

ValueError: Could not parse file: Line 2: illegal character - and the error is for the line in **,