Linlab-slu / TSSr

Other
9 stars 3 forks source link

Is it mandatory to use a BSgenome file as input? #18

Closed JirkaMP closed 4 months ago

JirkaMP commented 5 months ago

Greetings, I hope you are doing well! I was trying to use TSSr in R, but I stumbled across an issue when trying to forge a BSgenome of my virus from NCBI genome, which I want to analyse. Basically, the NCBI submission is faulty and I cant generate a BSgenome without asking the submitters to check their submission again... (the BSgenome forging works for basically every other virus I tested, just not for mine ... ) So my question, does TSSr only take BSgenomes as input, or could I for example also only use a fast file? My virus has a single circular dsDNA genome of around 228 kb size.

Here the TSSr command I intended to use;

myTSSr <- new("TSSr", genomeName = genomeBS, inputFiles = inputfiles, inputFilesType = inputfilestype, sampleLabels = c("B1", "B2", "B3"), sampleLabelsMerged = c("9hpi"), refSource = refsource, organismName = "Heliothis zea nudivirus 1")

Thank you in advance!

Cheers Jirka

Linlab-slu commented 5 months ago

Hi Jirka,

TSSr requires BSgenome to call TSS from mapped read data and several other functions. I am sorry that TSSr does not work without providing the corresponding BSgenome package.

Best,

Zhenguo

[cid:701fc0bf-e09d-4a27-bc05-add3cf2ce24c]http://www.zlinlab.org/


From: JirkaMP @.> Sent: Friday, April 19, 2024 2:12 AM To: Linlab-slu/TSSr @.> Cc: Subscribed @.***> Subject: [External] [Linlab-slu/TSSr] Is it mandatory to use a BSgenome file as input? (Issue #18)

CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe. When in doubt, please report the email to SLUAware by clicking on the three dots within your email and choosing Report To SLUAware.

Greetings, I hope you are doing well! I was trying to use TSSr in R, but I stumbled across an issue when trying to forge a BSgenome of my virus from NCBI genome, which I want to analyse. Basically, the NCBI submission is faulty and I cant generate a BSgenome without asking the submitters to check their submission again... (the BSgenome forging works for basically every other virus I tested, just not for mine ... ) So my question, does TSSr only take BSgenomes as input, or could I for example also only use a fast file? My virus has a single circular dsDNA genome of around 228 kb size.

Here the TSSr command I intended to use;

myTSSr <- new("TSSr", genomeName = genomeBS, inputFiles = inputfiles, inputFilesType = inputfilestype, sampleLabels = c("B1", "B2", "B3"), sampleLabelsMerged = c("9hpi"), refSource = refsource, organismName = "Heliothis zea nudivirus 1")

Thank you in advance!

Cheers Jirka

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/Linlab-slu/TSSr/issues/18__;!!K543PA!LNnJVgVJEqiG1OjGuo1KRUtIx5iEFqjcmvS1iAlLjT6la1nMgRBFFs-vfSsUjG1cUdXbVHgNrP6SFhi54FQ2nvkNONU$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ALOYBZRJXGTLOTUOI7G5WYLY6C7XJAVCNFSM6AAAAABGOUFA3OVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TEMRVGAZDCNY__;!!K543PA!LNnJVgVJEqiG1OjGuo1KRUtIx5iEFqjcmvS1iAlLjT6la1nMgRBFFs-vfSsUjG1cUdXbVHgNrP6SFhi54FQ2J5SylJk$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

JirkaMP commented 5 months ago

Dear Zhenguo, I managed to forge a BSgenome and run the TSSr tool up the point of creating consensus clusters. However, I'm not sure how to interpret the results, here an excerpt from the extracted table:

cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width 254 AF451898.1 67163 67191 + 67185 425.051366 150.824676 67163 67191 29 255 AF451898.1 67325 67347 + 67347 191.958682 123.402009 67325 67347 23 256 AF451898.1 67434 67604 + 67497 1974.432144 164.536011 67444 67547 104 265 AF451898.1 67852 69866 + 69362 164810.238006 1275.154088 68146 69650 1505 269 AF451898.1 70595 70595 + 70595 27.422669 27.422669 70595 70595 1 270 AF451898.1 71295 71295 + 71295 41.134004 41.134004 71295 71295 1 271 AF451898.1 72171 72171 + 72171 27.422669 27.422669 72171 72171 1 272 AF451898.1 72475 72475 + 72475 41.134004 41.134004 72475 72475 1 273 AF451898.1 74502 74502 + 74502 27.422669 27.422669 74502 74502 1 274 AF451898.1 74609 74636 + 74635 123.402012 54.845339 74609 74636 28 275 AF451898.1 74846 74846 + 74846 54.845339 54.845339 74846 74846 1 276 AF451898.1 74953 75772 + 75695 20169.372753 863.814058 75067 75692 626

I presume cluster 265 with dominant_TSS = 164810.238006 would then be considered a TSS at position 69362bp of the genome right? However, that position would be quite far inside the coding sequence of the gene, which you can infer when visualizing the BAM files I used as input in Geneious (see image). Rather I would have presumed that the TSS is somewhere around 70,180bp for orf51 and around 70,300bp for orf52, but that's just my subjective assessment. I know this example of a dsDNA virus might feature some novelties for the package, but I was wondering if there are maybe ways/parameters that could be adapted in TSSr?

orf51 and orf52

Please see my code below:

myTSSr <- new("TSSr", genomeName = "BSgenome.Hgenome.NCBI.ASM282678v1", inputFiles = inputfiles, inputFilesType = inputfilestype, sampleLabels = c("3hpi", "3hpi", "3hpi", "6hpi", "6hpi", "6hpi", "9hpi", "9hpi", "9hpi"), sampleLabelsMerged = c("3hpi", "6hpi", "9hpi"), refSource = refsource, organismName = "Heliothis zea nudivirus 1")

results <- getTSS(myTSSr)

results_merged <- mergeSamples(results, mergeIndex = c(1,1,1,2,2,2,3,3,3)) #merge biological replicates

results1 <- filterTSS(results_merged, method = "poisson", pVal = 0.01, normalization = TRUE)

results_clustered <- clusterTSS(results1, method = "peakclu",peakDistance=100,extensionDistance=30 ,localThreshold = 0.02, clusterThreshold = 1 ,useMultiCore=FALSE, numCores=NULL)

results_cc <- consensusCluster(results_clustered, dis = 50, useMultiCore = FALSE, numCores = NULL)

Lastly I exported the tables

Best Jirka

Linlab-slu commented 5 months ago

Hi Jirka,

How to assign a TSS cluster (TC) to its downstream gene replies on the parameters you used "annotateCluster". The default is "upstream=1000", which means that TSSr is assigned a TC if its dominant TSS is within 1000 bp upstream of the annotated gene ORF. You may change ""upstream" parament to a smaller value if 1000 is too long for the virus genome.

Another way to inspect TC assignment is to export TC data to BW file and visualize them in IGV or other genome browsers (exportTSStoBedgraph(myTSSr, data = "processed", format = "BigWig")).

I hope it helps.

Best,

Zhenguo

[cid:298d0140-01a4-4a3a-87bf-540d6c84426f]http://www.zlinlab.org/


From: JirkaMP @.> Sent: Sunday, April 21, 2024 7:13 AM To: Linlab-slu/TSSr @.> Cc: Zhenguo Lin @.>; Comment @.> Subject: [External] Re: [Linlab-slu/TSSr] Is it mandatory to use a BSgenome file as input? (Issue #18)

CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe. When in doubt, please report the email to SLUAware by clicking on the three dots within your email and choosing Report To SLUAware.

Dear Zhenguo, I managed to forge a BSgenome and run the TSSr tool up the point of creating consensus clusters. However, I'm not sure how to interpret the results, here an excerpt from the extracted table:

cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width 254 AF451898.1 67163 67191 + 67185 425.051366 150.824676 67163 67191 29 255 AF451898.1 67325 67347 + 67347 191.958682 123.402009 67325 67347 23 256 AF451898.1 67434 67604 + 67497 1974.432144 164.536011 67444 67547 104 265 AF451898.1 67852 69866 + 69362 164810.238006 1275.154088 68146 69650 1505 269 AF451898.1 70595 70595 + 70595 27.422669 27.422669 70595 70595 1 270 AF451898.1 71295 71295 + 71295 41.134004 41.134004 71295 71295 1 271 AF451898.1 72171 72171 + 72171 27.422669 27.422669 72171 72171 1 272 AF451898.1 72475 72475 + 72475 41.134004 41.134004 72475 72475 1 273 AF451898.1 74502 74502 + 74502 27.422669 27.422669 74502 74502 1 274 AF451898.1 74609 74636 + 74635 123.402012 54.845339 74609 74636 28 275 AF451898.1 74846 74846 + 74846 54.845339 54.845339 74846 74846 1 276 AF451898.1 74953 75772 + 75695 20169.372753 863.814058 75067 75692 626

I presume cluster 265 with dominant_TSS = 164810.238006 would then be considered a TSS at position 69362bp of the genome right? However, that position would be quite far inside the coding sequence of the gene, which you can infer when visualizing the BAM files I used as input in Geneious (see image). Rather I would have presumed that the TSS is somewhere around 70,180bp for orf51 and around 70,300bp for orf52, but that's just my subjective assessment. I know this example of a dsDNA virus might feature some novelties for the package, but I was wondering if there are maybe ways/parameters that could be adapted in TSSr?

orf51.and.orf52.PNG (view on web)https://urldefense.com/v3/__https://github.com/Linlab-slu/TSSr/assets/166142670/9c4f1c73-bc9d-4778-a624-6fd0cd51805e__;!!K543PA!ISkGR4dZcRCL23HiJaXx0QyEG7j0dikwkdoKFZbs5Ejfhv8bObAzCM-QnXr_tU8UZNfNmixafQblOnJgRwTwUlQorp4$

Please see my code below:

myTSSr <- new("TSSr", genomeName = "BSgenome.Hgenome.NCBI.ASM282678v1", inputFiles = inputfiles, inputFilesType = inputfilestype, sampleLabels = c("3hpi", "3hpi", "3hpi", "6hpi", "6hpi", "6hpi", "9hpi", "9hpi", "9hpi"), sampleLabelsMerged = c("3hpi", "6hpi", "9hpi"), refSource = refsource, organismName = "Heliothis zea nudivirus 1")

results <- getTSS(myTSSr)

results_merged <- mergeSamples(results, mergeIndex = c(1,1,1,2,2,2,3,3,3)) #merge biological replicates

results1 <- filterTSS(results_merged, method = "poisson", pVal = 0.01, normalization = TRUE)

results_clustered <- clusterTSS(results1, method = "peakclu",peakDistance=100,extensionDistance=30 ,localThreshold = 0.02, clusterThreshold = 1 ,useMultiCore=FALSE, numCores=NULL)

results_cc <- consensusCluster(results_clustered, dis = 50, useMultiCore = FALSE, numCores = NULL)

Lastly I exported the tables

Best Jirka

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/Linlab-slu/TSSr/issues/18*issuecomment-2068022137__;Iw!!K543PA!ISkGR4dZcRCL23HiJaXx0QyEG7j0dikwkdoKFZbs5Ejfhv8bObAzCM-QnXr_tU8UZNfNmixafQblOnJgRwTwSdLL5NY$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ALOYBZS2CYCB4MGNTVI66DTY6OUNLAVCNFSM6AAAAABGOUFA3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYGAZDEMJTG4__;!!K543PA!ISkGR4dZcRCL23HiJaXx0QyEG7j0dikwkdoKFZbs5Ejfhv8bObAzCM-QnXr_tU8UZNfNmixafQblOnJgRwTw702wuJo$. You are receiving this because you commented.Message ID: @.***>

JirkaMP commented 5 months ago

Hey Zhenguo, thank you for your remarks, it was very helpful for me to change the upstream parameter! I've been playing around a bit with the different parameters, and there is still some fine-tuning required. Maybe you can once again offer me your insight?

I changed the code of the last three operations to this now:

results_clustered <- clusterTSS(results1, method = "peakclu",peakDistance=50,extensionDistance=0 ,localThreshold = 0.02, clusterThreshold = 1 ,useMultiCore=FALSE, numCores=NULL)

results_cc <- consensusCluster(results_clustered, dis = 50, useMultiCore = FALSE, numCores = NULL)

results_acc <- annotateCluster(results_cc,clusters = "consensusClusters",filterCluster = TRUE, filterClusterThreshold = 0.02, annotationType = "genes" ,upstream=200, upstreamOverlap =20, downstream = 0)

While I've now got genes that match with the read coverage that I see in Geneious (see values and picture for example of orf97, where the TSS at 139752bp is what I would also expect from the coverages in Geneious)...

cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width gene inCoding 949 AF451898.1 139752 139752 + 139752 5.341887 5.341887 139752 139752 1 orf97 NA
orf96 orf97

... there are other cases where two TSSs are determined for a gene, one that matches the coverage and an undesirable one that represents the coverage of a neighbouring gene (for example orf59, where 84311bp would be the TSS for orf59 that I would expect based on the coverage, and the one at 84450bp are reads related to orf60, so undesired...)

1862 AF451898.1 84311 84311 - 84311 4.006416 4.006416 84311 84311 1 orf59 NA 1863 AF451898.1 84450 84450 - 84450 4.674152 4.674152 84450 84450 1 orf59 NA orf59 orf60

Do you have any suggestions about what parameters could be modified or made more strict to eliminate false-positive TSSs, such as those in orf59?

Thanks for taking the time to read my comment!

Cheers Jirka

Linlab-slu commented 5 months ago

Hi Jirka,

Based on your screenshot, orf59 and orf60 are on different strands. In TSSr, TSS clusters can only be assigned to its downstream gene on the same strand. If your sequencing data is strand-specific, and TSS cluster 1863 (84450) is on the forward strand, it should be assigned to orf60. If your sequencing data is not strand-specific, you probably need to manual adjust gene assignment based on actual gene orientation. According to read mapping screenshot, these reads look like regular RNA-seq reads, instead of 5'end sequencing reads (e.g. CAGE, TSS-seq). If they are NRA-seq reads, you can use them to infer approximate locations of TSSs, but it is not very accurate .

Best,

Zhenguo

[cid:285a6762-2aa4-4961-a674-1d091a1a717a]http://www.zlinlab.org/


From: JirkaMP @.> Sent: Tuesday, April 23, 2024 3:18 PM To: Linlab-slu/TSSr @.> Cc: Zhenguo Lin @.>; Comment @.> Subject: [External] Re: [Linlab-slu/TSSr] Is it mandatory to use a BSgenome file as input? (Issue #18)

CAUTION: This email is from an external source. Do not click links or open attachments unless you recognize the sender and know the content is safe. When in doubt, please report the email to SLUAware by clicking on the three dots within your email and choosing Report To SLUAware.

Hey Zhenguo, thank you for your remarks, it was very helpful for me to change the upstream parameter! I've been playing around a bit with the different parameters, and there is still some fine-tuning required. Maybe you can once again offer me your insight?

I changed the code of the last three operations to this now:

results_clustered <- clusterTSS(results1, method = "peakclu",peakDistance=50,extensionDistance=0 ,localThreshold = 0.02, clusterThreshold = 1 ,useMultiCore=FALSE, numCores=NULL)

results_cc <- consensusCluster(results_clustered, dis = 50, useMultiCore = FALSE, numCores = NULL)

results_acc <- annotateCluster(results_cc,clusters = "consensusClusters",filterCluster = TRUE, filterClusterThreshold = 0.02, annotationType = "genes" ,upstream=200, upstreamOverlap =20, downstream = 0)

While I've now got genes that match with the read coverage that I see in Geneious (see values and picture for example of orf97, where the TSS at 139752bp is what I would also expect from the coverages in Geneious)...

cluster chr start end strand dominant_tss tags tags.dominant_tss q_0.1 q_0.9 interquantile_width gene inCoding 949 AF451898.1 139752 139752 + 139752 5.341887 5.341887 139752 139752 1 orf97 NA orf96.orf97.PNG (view on web)https://urldefense.com/v3/__https://github.com/Linlab-slu/TSSr/assets/166142670/83652322-1e6f-4714-b1a3-3237d1a9d8a6__;!!K543PA!Mkz4B708iqL8esLvYHKH-MPGh7VuqR01-P0jwnv0F-WB8ofky9rBbJL38AjaDSNTi4nVfMSyq840A-qWJ6i0nEoEUbo$

... there are other cases where two TSSs are determined for a gene, one that matches the coverage and an undesirable one that represents the coverage of a neighbouring gene (for example orf59, where 84311bp would be the TSS for orf59 that I would expect based on the coverage, and the one at 84450bp are reads related to orf60, so undesired...)

1862 AF451898.1 84311 84311 - 84311 4.006416 4.006416 84311 84311 1 orf59 NA 1863 AF451898.1 84450 84450 - 84450 4.674152 4.674152 84450 84450 1 orf59 NA orf59.orf60.PNG (view on web)https://urldefense.com/v3/__https://github.com/Linlab-slu/TSSr/assets/166142670/f251f95b-6b6e-45c7-abc6-1934cb0770a9__;!!K543PA!Mkz4B708iqL8esLvYHKH-MPGh7VuqR01-P0jwnv0F-WB8ofky9rBbJL38AjaDSNTi4nVfMSyq840A-qWJ6i0djEj8HM$

Do you have any suggestions about what parameters could be modified or made more strict to eliminate false-positive TSSs, such as those in orf59?

Thanks for taking the time to read my comment!

Cheers Jirka

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/Linlab-slu/TSSr/issues/18*issuecomment-2073372346__;Iw!!K543PA!Mkz4B708iqL8esLvYHKH-MPGh7VuqR01-P0jwnv0F-WB8ofky9rBbJL38AjaDSNTi4nVfMSyq840A-qWJ6i0mtWvs6I$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ALOYBZV2AWU7IA3Z6INQTELY6262LAVCNFSM6AAAAABGOUFA3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZTGM3TEMZUGY__;!!K543PA!Mkz4B708iqL8esLvYHKH-MPGh7VuqR01-P0jwnv0F-WB8ofky9rBbJL38AjaDSNTi4nVfMSyq840A-qWJ6i0U1aB1p4$. You are receiving this because you commented.Message ID: @.***>

JirkaMP commented 5 months ago

Hey Zhenguo, aah I see, so TSSr was designed to only consider unidirectional sequencing, instead of birectional? But then, I'm wondering about this one example here, where TSSr quite accurately (I would say) inferred a matching TSS for orf79 (117646) which is in minus direction (in contrast to e.g. orf):

2061 AF451898.1 117646 117646 - 117646 5.341887 5.341887 117646 117646 1 orf79 NA orf79 and orf81

Or am I misinterpreting something here?

Also, my data is indeed Illumina RNAseq data (stranded library fr-firststrand). I already feared that the accuracy of my data wouldn't be comparable to that of CAGE or TSS-seq, but I was still wondering if it can at least help to support the localization of a putative viral promoter we identified with MEME suite (see the light green square in the figures above between the gene annotations), simply by showing that there is a significant increase in coverage in proximity to that motif.

Sorry for making this issue so laborious, but the topic is also quite intruiging for me.

Cheers Jirka