MrOlm / inStrain

Bioinformatics program inStrain
MIT License
134 stars 33 forks source link

Enquiry about "codecs.py" and strain relative abundance #159

Open jiangys30 opened 9 months ago

jiangys30 commented 9 months ago

Hi,

I want to express my gratitude to inStrain. I have read some articles on strain-level analysis tools, and I believe that inStrain provides one of the most comprehensive and accurate predictions in this field.

I am currently using inStrain to process paired-end .fq data, and I have encountered the following issue when the program reaches "inStrain profile Step 2. Profile scaffolds":

File "/lustre1/g/aos_shihuang/tools/anaconda3/envs/inStrain/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 3: invalid start byte.

I have not been able to find anyone else with the same issue on Google. One possible solution I have considered is upgrading Python to version 3.9, but I am uncertain if this is the correct action.

Additionally, a friend informed me that the results obtained from inStrain do not include statistics on strain relative abundance. I noticed that the log.log file mentions "coverm," which is capable of calculating DNA read coverage and relative abundance. If I were to install coverm within the inStrain environment, would I obtain information about strain relative abundance in the output?

Thank you for your time and assistance.

Cheers,

Jason

MrOlm commented 9 months ago

Hi Jason,

Thanks for a kind words. A few things:

1) That error indicates that one of the files you're providing to python can't be read with standard encoding. Maybe you're accidentally including compressed files that aren't allowed? Maybe you have special non-english characters in one of your files? I can try and help troubleshoot further if you post the full command and error code.

2) Yes that's somewhat correct. While inStrain will tell you which samples have the same strain, the only relative abundance information provided is that relative abundance of the genome that that strain comes from. You can calculated with coverm, but that's not necessary. By default inStrain will tell you the relative abundance of each genome in the input.

Best, Matt

jiangys30 commented 9 months ago

Hi Matt,

Thanks for your help! Here are the full command and the error code

Just in case, there is also a sample list, which is used for commands to read the sequence reads.

Cheers,

Jason

MrOlm commented 9 months ago

Hi Jason,

According to that error there seems to be a non-standard character in reference_genes.fna. Maybe open that file up and see if you notice anything strange about it? It should look like a standard, uncompressed .fasta file.

Best, Matt