cerebis / sim3C

Read-pair simulation of 3C-based sequencing methodologies (HiC, Meta3C, DNase-HiC)
GNU General Public License v3.0
19 stars 5 forks source link

Fasta sequences which contain IUPAC characters other than ACGT throw an exception in Art.py #5

Closed Tocci89 closed 6 years ago

Tocci89 commented 6 years ago

Hi, I want to simulate Hi-C reads from my fasta reference. I set all the options and run the command as reported below: python /my/path/to/sim3C/sim3C.py -C gzip -m hic -e MboI -l 125 -r 23 -n 232709663 --dist uniform --machine-profile HiSeq2500L125 MYREF.fasta OUTPUT_sim_reads.fastq.gz

Here's what I get:

Warning: no reference supplied, calls will have to supply a template Starting sequencing simulation Library method: hic Progress: 0%| | 62/232709663 [00:00<211:14:51, 306.00it/s] Error: 'n'

I'm running sim3C.py with python 2.7; the fasta file has a .fai index.

Thanks in advance for your help

cerebis commented 6 years ago

Though that is definitely not an intended error message, did you happen to leave on the --fasta when passing the reference sequence?

Eg. Before MYREF.fasta

python /my/path/to/sim3C/sim3C.py -C gzip -m hic -e MboI -l 125 -r 23 -n 232709663 --dist uniform --machine-profile HiSeq2500L125 --fasta MYREF.fasta OUTPUT_sim_reads.fastq.gz

Please let me know how you make out.

Tocci89 commented 6 years ago

If I add --fasta I get: sim3C.py: error: unrecognized arguments: --fasta

Anyway, what bothers me is the "Error: 'n' " issue which I don't understand what it is referring to.

cerebis commented 6 years ago

Sorry my mistake, the UX change I was referring to is within an experimental branch (not intended for use at the moment). Not that I expected it to fix that error. Never answer bug reports late in the evening.

In testing this myself with your exact command line, I do not get an error but haven't the benefit of your reference data. Can you send me the file or post it perhaps to Zenodo

cerebis commented 6 years ago

I have just pushed a small update to the master branch which has a small improvement to error handling when reading the reference sequence file. The commit also includes the option to print a trace of the exception.

Could you try running Sim3C again and also add the --debug option.

You might get a better error message now, but the trace would help me.

I have a suspicion the error you are experiencing is originating within Bio.SeqIO.

Tocci89 commented 6 years ago

Thank you for the help. I think I've solved the mystery... I tried running with other fasta and multifasta files and the script works. So the problem was in my input fasta and I found out that the only difference was that letters in my fasta were all lowercase. After converting in uppercase everything worked well. I was confident that the translator in Art.py could work in both cases, but maybe lowercase "n" are causing troubles. Anyway, is working fine now! Thanks again!!!

cerebis commented 6 years ago

Ok, thank you for the report. That is a surprising defect!

cerebis commented 6 years ago

This error is actually due to non-ACGT charcters and not case related. Thanks for bringing my attention to it.