cerebis / sim3C

Read-pair simulation of 3C-based sequencing methodologies (HiC, Meta3C, DNase-HiC)
GNU General Public License v3.0
19 stars 6 forks source link

References without cut-sites should still produce spurious read-pairs and not just excluded from simulation. #18

Open Mustafa-Albekaa opened 4 years ago

Mustafa-Albekaa commented 4 years ago

I've been having some trouble simulating HiC reads, and after an hour of troubleshooting I think I've identified the issue.

This is the command I've been running, and the error I've been running into.

sim3C --dist uniform -n 10000 -l 150 -e Sau3AI -m hic --profile-name ${genome}_simhic_profile.tsv $genome.fasta ${genome}_simhic.fastq

ERROR    | 2020-08-25 02:09:05,237 |    main | 'Seq' object has no attribute 'id'
Traceback (most recent call last):
  File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/command_line.py", line 213, in main
    args.num_pairs, args.method, args.read_length, **kw_args)
  File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/simulator.py", line 307, in __init__
    create_cids=create_cids, linear=linear)
  File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/community.py", line 507, in __init__
    random_state, create_cids, linear))
  File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/community.py", line 82, in __init__
    self.sites = CutSites(enzyme, seq.seq, self.random_state, linear=linear)
  File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/site_analysis.py", line 63, in __init__
    raise NoCutSitesException(template_seq.id, str(enzyme))
AttributeError: 'Seq' object has no attribute 'id'

I believe the problem is that template_seq does not have an id method. Using type() on template_seq identifies it as a Bio.Seq.Seq object.

I've removed the sequences that were causing the issue and am now able to run the program, but this bug meant I was not able to easily identify which sequences did not have cut sites.

cerebis commented 4 years ago

Hi Mustafa,

Without looking, this sounds suspiciously like a change in the Biopython API. There has been two attributes which contain the same value Bio.Seq.name and Bio.Seq.id http://bio.seq.id/. It might be that .id has finally been dropped. Just a guess for now. There should be a fix with some version pinning to avoid this — if my suspicion is correct.

On that note, could you provide the output of pip freeze?

On 25 Aug 2020, at 7:22 pm, Mustafa-Albekaa notifications@github.com wrote:

I've been having some trouble simulating HiC reads, and after an hour of troubleshooting I think I've identified the issue.

This is the command I've been running, and the error I've been running into.

sim3C --dist uniform -n 10000 -l 150 -e Sau3AI -m hic --profile-name ${genome}_simhic_profile.tsv $genome.fasta ${genome}_simhic.fastq

ERROR | 2020-08-25 02:09:05,237 | main | 'Seq' object has no attribute 'id' Traceback (most recent call last): File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/command_line.py", line 213, in main args.num_pairs, args.method, args.read_length, **kw_args) File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/simulator.py", line 307, in init create_cids=create_cids, linear=linear) File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/community.py", line 507, in init random_state, create_cids, linear)) File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/community.py", line 82, in init self.sites = CutSites(enzyme, seq.seq, self.random_state, linear=linear) File "/home/mustafa/.local/lib/python2.7/site-packages/sim3C/site_analysis.py", line 63, in init raise NoCutSitesException(template_seq.id, str(enzyme)) AttributeError: 'Seq' object has no attribute 'id' I believe the problem is that template_seq does not have an id method. Using type() on template_seq identifies it as a Bio.Seq.Seq object.

I've removed the sequences that were causing the issue and am now able to run the program, but this bug meant I was not able to easily identify which sequences did not have cut sites.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cerebis/sim3C/issues/18, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN2PC5CLOPISFMOTWAYHZTSCN7GJANCNFSM4QKNWZRA.

cerebis commented 4 years ago

Well ignore what I just said, seems this is entirely a bug in sim3C.

Mustafa-Albekaa commented 4 years ago

Hello Matthew,

I hope this will be fixed soon! Sim3C is very useful and been quite easy to use.

Output for pip freeze, in case you still need it, is:

biopython==1.76
BUSCO==3.1.0
certifi==2019.11.28
enum34==1.1.10
funcsigs==1.0.2
iced==0.4.2
intervaltree==3.0.2
llvmlite==0.31.0
numba==0.47.0
numpy==1.16.6
PyYAML==5.3.1
scipy==1.2.3
sim3C @ git+https://github.com/cerebis/sim3C@43e2ccfabf55f9ddb84754e9b29b8791d4bd34c0
singledispatch==3.4.0.3
six==1.15.0
sortedcontainers==2.2.2
tqdm==4.45.0
cerebis commented 4 years ago

I have committed a fix to handle this issue (9830b3c0b0a4f50e90922c3cbf061dbb076d72a6).

Unfortuntely, this will perhaps not be the logic you are hoping to see. Reference sequences which do not contain a cut-site will be ignored in the simulation, and if a cell contains only that replicon, it too will be ignored.

Regarding how sim3C simulates Hi-C reads, a sequence which contains no cutsites will not produce a read-pairs with proximity ligations. It would however, still be capable of spurious read-pairs (noise). I will leave this issue open, but modify the title to reflect that this should be addressed in future.