dovetail-genomics / Micro-C

Micro-C QC and data analysis
12 stars 2 forks source link

Question on filtering of cis-reads with insert sizes less than 1kb #3

Open NMaziak opened 1 year ago

NMaziak commented 1 year ago

Hello there, I'm new to using pairtools and have some Micro-C's to analyse and found your walkthrough (thanks, it's been very helpful!)

I had a question, and it might be that I keep missing it in the pairtools documentation, but I was wondering where in the pipeline is filtering of cis-interactions with insert size less than 1 kb happening (as its mentioned in the qc plot linked below)? https://micro-c.readthedocs.io/en/latest/library_qc.html

Any clarification is much appreciated, thanks again! Noura

cpadillla1331 commented 1 year ago

Hi Noura,

Thank you for your question. We do not filter out reads less than 1kbp in insert size. There are two reasons that we do not do this.

  1. While these reads are not topologically informative regarding TADs and Loop calling, these molecules contain information about how nucleosomes interact. For example, let's look at four nucleosomes in linear space. The footprint of that complex would be 4X 150bp = 600bp, plus linker DNA ~50bp 5X 50bp - 250bp, so together, this molecule occupies ~850bp in length. Then if we apply the Micro-C assay and the N1 forms a crosslink with N4 because they are close to each other - then that link would be an actual proximity ligation event with an insert size of ~600bp (assuming the midpoint of the nucleosome is being cross-linked). If we removed inserts <1kb, then we would never know that N1 and N4 are in contact with each other. Retaining these reads in the library is advantageous to investigate topology down to the nucleosome scale.
  2. These reads are not considered during larger topology feature calling. Algorithms that call TADs and Loops do not consider any interaction <20kb. Smaller insert read pairs don't impact the results of these tools. Moreover, filtering out short inserts would take a long time computationally, so if it doesn't impact the results, it's not worth removing them.

You've touched on one of the most considerable challenges in managing the change from a restriction enzyme to a sequence-independent fragmentation. Much of the classical nomenclature around restriction enzyme-based Hi-C read classification doesn't apply to MNase-based Hi-C. Still, we have a large customer base that is used to and expects this terminology, such as "valid" or "invalid" Hi-C reads. As such, we informed people how many topological informative reads there are by using a standard threshold of 1kb.

Hopefully that is helpful, Cory

--

Cory Padilla, Ph.D.

Product Manager | Cantata Bio o: 831-233-3779 | c: (650) 438-6910

[cid:5fe1a170-aa7b-4dec-a19f-b9aa4db0205c]https://www.cantatabio.com


From: NMaziak @.> Sent: Tuesday, March 7, 2023 5:26 AM To: dovetail-genomics/Micro-C @.> Cc: Subscribed @.***> Subject: [dovetail-genomics/Micro-C] Question on filtering of cis-reads with insert sizes less than 1kb (Issue #3)

**CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hello there, I'm new to using pairtools and have some Micro-C's to analyse and found your walkthrough (thanks, it's been very helpful!)

I had a question, and it might be that I keep missing it in the pairtools documentation, but I was wondering where in the pipeline is filtering of cis-interactions with insert size less than 1 kb happening (as its mentioned in the qc plot linked below)? https://micro-c.readthedocs.io/en/latest/library_qc.htmlhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicro-c.readthedocs.io%2Fen%2Flatest%2Flibrary_qc.html&data=05%7C01%7CCPadilla%40cantatabio.com%7C2e873fac60784eb9b98408db1f0f9db4%7Cb9b65b4abe6c4b83959e981062890f8e%7C1%7C0%7C638137924153886918%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CK9ouoW0PBKsxfljlyIRbP8OcWvPeRCcTOs75Ot4uJQ%3D&reserved=0

Any clarification is much appreciated, thanks again! Noura

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdovetail-genomics%2FMicro-C%2Fissues%2F3&data=05%7C01%7CCPadilla%40cantatabio.com%7C2e873fac60784eb9b98408db1f0f9db4%7Cb9b65b4abe6c4b83959e981062890f8e%7C1%7C0%7C638137924153886918%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kemqTREbcplBz6SF07elwGeLPYgMTcDmhqqijH40JnQ%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAIKN5E2O24SQG6762XJ444TW24ZRZANCNFSM6AAAAAAVSPEDZQ&data=05%7C01%7CCPadilla%40cantatabio.com%7C2e873fac60784eb9b98408db1f0f9db4%7Cb9b65b4abe6c4b83959e981062890f8e%7C1%7C0%7C638137924153886918%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sHt9nxzkhx7Xlz776iXhUoz4Gc8nrHzTzpimYNo9OxQ%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

NMaziak commented 11 months ago

Hi Cory,

Very sorry for the delay, but thank you for such an informative answer! I'm testing out some of the new features in pairtools and consequently am revisiting this. Just to make sure then, the only filtration you are doing is deduplicating and removing pairs which are separated by less than 30 bp, is that correct?

All the best, Noura

cpadillla1331 commented 11 months ago

Hi Noura - Yup - you got it!

--

Cory Padilla, Ph.D.

Product Manager | Cantata Bio o: 831-233-3779 | c: (650) 438-6910

[cid:22095e63-fd07-49fe-8f2a-6cd4afabba7b]https://www.cantatabio.com/


From: Noura @.> Sent: Wednesday, November 8, 2023 3:52 AM To: dovetail-genomics/Micro-C @.> Cc: Cory Padilla @.>; Comment @.> Subject: Re: [dovetail-genomics/Micro-C] Question on filtering of cis-reads with insert sizes less than 1kb (Issue #3)

**CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Cory,

Very sorry for the delay, but thank you for such an informative answer! I'm testing out some of the new features in pairtools and consequently am revisiting this. Just to make sure then, the only filtration you are doing is deduplicating and removing pairs which are separated by less than 30 bp, is that correct?

All the best, Noura

— Reply to this email directly, view it on GitHubhttps://github.com/dovetail-genomics/Micro-C/issues/3#issuecomment-1801735447, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIKN5E2PDMZI43Y2TIBQM3TYDNXATAVCNFSM6AAAAAAVSPEDZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBRG4ZTKNBUG4. You are receiving this because you commented.Message ID: @.***>