bioinfo-biols / CIRI-long

Circular RNA Identification for Nanopore Sequencing
https://ciri-cookbook.readthedocs.io
MIT License
17 stars 5 forks source link

CCS copy number in each read #13

Open mmaitenat opened 2 years ago

mmaitenat commented 2 years ago

Hi again,

I would like to know if the number of times a circRNA is repeated in each read (which I think you call CCS copy number) is reported somewhere in the output of CIRILONG. Mi idea is to get plots similar to those in Supplementary Figure 7 in your article "Comprehensive profiling of circular RNAs with nanopore sequencing and CIRI-long" with my own data.

Thanks!

Maitena.

Kevinzjy commented 2 years ago

Hi, you can find the information you need in the 7th column of *.cand_circ.fa generated by the CIRI-long call command, which contains the start and end position of CCS segments in the raw ONT reads.

>9b2ec396-b290-4b90-b115-ff3fbc33076d   chr1:87938154-87940306  +       87938154-87938265|114,87938346-87938438|94,87939196-87939315|121,87940177-87940306|130      AG-GT|2-1       288|0-463       10-452;452-905;905-1358;1358-1467
GCATTCAGGGAGATAGCACAGTCCCACAGAGCCATGGAACAGGAGCTCGCACATGCTGTCAATGCCAGCTCCAAAGCCATGGAGCAGTATACAGCAAGCCCAGAACTGCAGAGGGTTGAACTGCCAGCTTTGTTCTGGAGATGGTGAATAACATCAGAGCACTGCGCAGTGAGACAGAGCTGCTGCTGGCTGGGAAGATGGCCCTGCAATTGGATCCCCCTCAGAAGGAACGGCAGAAACCGGGGCTGCCCTAATTGAGATGGACCAGCAGCTCAGGAAGCTGACAGACACTCCCTGGCTTTACGCCAGCCCTTGGAAGCCTGGTGAGGAAGAGTCTCTCCAACAGAATGTGATGCTGGATCTTACTAAACGCAGCCGTAGTGGTAAATTCCGCCTTGTGACCAAGTTTAAAAAGGAGAAAAACAATAAGAACAAAGAAGTTCACAGTAACCTAGGAGGCCCT
>82ca865f-cd01-4c4b-a12b-5343d9f8464b   chr4:95850782-95851509  +       95850782-95851509|731   AG-GT*|-3--6    611|3-725       41-758;758-845
GGTAGTCCTCTAGAGCTGATGAGGTTTGTAGAGTCAGACCCCAGCTACAGCTGTAGAACCAGGCATCCTTGGTTGCTGGAAACCAATCCTGGAAGCAGAGTACTAGCGCATGCCCAAACTCATGAAACAGCCAGTATAGAGCTGGAAGAAAGTCAGACCCCCAGCTACCAGCTGAGAACCAGGCACTTCAACCACTTGCCCGCATGCCCCAGTGTTAGAAGTGACAAACCAGGTGTTCTAATAATTTTTAATAATTGGGAATTCAATTTGCTGTGACTGCCTGAGTGTGGCAGACCCTGTGCTAAGTTCTTTAGTATAGCTCTCCTAATGCATATAATACCCTTTCATGGCCTGTAAGAGGGCCAGAAACTTACAAACACAGACCATTAGAAACCTCCAGTGGCAGAAGCCCATTTCCAGTTTAAGAATGGAGCTGGGCATGTGGCTTGGTGCTTAAAGCACTTCTGTCTTCCAGAGGACCTGCATCAATTTCCAGTACATTGTTGGTTCATCTGTGGAGTTATCATCTGTAACTCCGGTACCAGGAGTCTACTGCCCTCTCCTTCTGGAATTACCCTGGTGGTGGTGCCTATGCATAAACCTATCATTCAATCTATACAAAACAAACTAATCAATTACTCAATACGAAATAATATGTGCAACTAATTGTCATTGGATGGGCTGACTGTAGTGATGAATTGTCTCATAAAAGGTCAGTCTGGGCA

The *.reads output of the CIRI-long collapse also includes the correspondence between the read id (1st column) and the collapsed circRNA id (2nd column).

read_id circ_id tmp_id  strand  cirexons        signal  alignment       segments        sample  type
d6637a72-5a5b-41e6-8341-25ed39330ed2    chr1:3421702-3526342    chr1:3421702-3526342    -       3421702-3421901|201,3516918-3517016|100,3517613-3517717|106,3523427-3523692|267,3526200-3526342|137 AG-GT|1-7       263|9-831       41-858;858-1254 Long_SMARTer_H-_repfull
0ab11e74-33e9-4272-aae0-2a22035e7bc1    chr1:3421702-3526342    chr1:3421702-3526342    -       3421702-3421901|199,3516918-3517016|100,3517613-3517717|106,3523427-3523692|267,3526200-3526342|146 AG-GT|-1--2     151|0-831       10-827;827-1232 Long_SMARTer_H-_repfull
mmaitenat commented 2 years ago

Hi! That's clear, thanks! I am so sorry for asking so many questions, but I'm afraid I have a few more... When I was going through this, I found in the .info files circRNAs with negative length values. Let me show you some examples: grep 'circ_len "-' barcode03.info | head -5 `2 CIRI-long circRNA 76696524 76696522 3 - . circ_id "2:76696524-76696522"; splice_site "AG-GT|0--2"; equivalent_seq ""; circ_type "Unknown"; circ_len "-2"; isoform "76696524-76696522"; 2 CIRI-long circRNA 117281827 117281825 5 + . circ_id "2:117281827-117281825"; splice_site "AG-GT|7-5"; equivalent_seq "G"; circ_type "Unknown"; circ_len "-2"; isoform "117281827-117281825"; 2 CIRI-long circRNA 121347282 121347280 2 + . circ_id "2:121347282-121347280"; splice_site "AG-GT|10-8"; equivalent_seq ""; circ_type "Unknown"; circ_len "-2"; isoform "121347282-121347280"; 2 CIRI-long circRNA 128669829 128669827 5 - . circ_id "2:128669829-128669827"; splice_site "AG-GT|-7--9"; equivalent_seq ""; circ_type "Unknown"; circ_len "-2"; isoform "128669829-128669827"; 3 CIRI-long circRNA 89958600 89958598 2 - . circ_id "3:89958600-89958598"; splice_site "AG-GT|5-3"; equivalent_seq ""; circ_type "Unknown"; circ_len "-2"; isoform "89958600-89958598"; I also found in the same file circRNAs with unknown strand and splice_site info, as follows: grep 'splice_site "None' barcode03.info | head -5 1 CIRI-long circRNA 3215147 3215449 5 None . circ_id "1:3215147-3215449"; splice_site "None"; equivalent_seq ""; circ_type "Unknown"; circ_len "302"; isoform "3215147-3215449"; 1 CIRI-long circRNA 9940139 9940778 5 None . circ_id "1:9940139-9940778"; splice_site "None"; equivalent_seq ""; circ_type "Unknown"; circ_len "639"; isoform "9940139-9940778"; 1 CIRI-long circRNA 15396359 15396994 7 None . circ_id "1:15396359-15396994"; splice_site "None"; equivalent_seq ""; circ_type "Unknown"; circ_len "635"; isoform "15396359-15396994"; 1 CIRI-long circRNA 22552121 22552535 2 None . circ_id "1:22552121-22552535"; splice_site "None"; equivalent_seq "ggg"; circ_type "Unknown"; circ_len "414"; isoform "22552121-22552535"; 1 CIRI-long circRNA 32390644 32391226 2 None . circ_id "1:32390644-32391226"; splice_site "None"; equivalent_seq ""; circ_type "Unknown"; circ_len "582"; isoform "32390644-32391226";`

Could you be so kind to explain which situation do these circRNAs correspond to and how should I treat them?

Thank you very much!

mmaitenat commented 2 years ago

I am so sorry, I just saw an issue regarding the circRNAs with negative and 0 length, and your recommendation to remove them as they come from erroneous reads. Still, I was wondering whether I should keep those with splice_site="None" or these may be errors too.

Thanks!

Kevinzjy commented 2 years ago

Hi, the current version of CIRI-long on GitHub will remove these negative length circRNAs, and I will update the version on PyPI with the next formal release.

splice_site='None' means no pre-defined splice site could be found in the BSJ region of CCS reads, it's hard to tell whether these circRNAs are reverse transcription artifacts or real circRNAs. If you're using model species with well-defined splice sites, then it's better to filter them out.