fredhutchio / tfcb_2019

class materials for MCB517A through UW/Fred Hutch
10 stars 10 forks source link

Questions for homework 3 #21

Open k8hertweck opened 4 years ago

k8hertweck commented 4 years ago

Include questions about homework 3 here! @gavinha and I will be happy to help answer them.

LABrumage commented 4 years ago

Is the code for 1a and 1b supposed to go all in the same code block below 1b? Also, for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments? Thank you!

k8hertweck commented 4 years ago

@LABrumage Sorry about leaving out a code chunk! Please add one for 1a.

@gavinha, can you speak to this question:

for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments?

rcsegura commented 4 years ago

We've (@LABrumage + @sbest0128) been trying to get the Segment_Mean for the copy number segments in problem 1b, but we are stuck. We have the overlaps, but we can't filter out which means are associated with the overlap values - is there anything that we should try doing?

gavinha commented 4 years ago

We've (@LABrumage + @sbest0128) been trying to get the Segment_Mean for the copy number segments in problem 1b, but we are stuck. We have the overlaps, but we can't filter out which means are associated with the overlap values - is there anything that we should try doing?

Hi @rcsegura

Please refer to the R Markdown file from the lecture: https://github.com/fredhutchio/tfcb_2019/blob/b4828fba7e97fc6018d42d0863ba0bd08370e456/lectures/lecture08/Lecture8_GenomicData.Rmd#L165-L166

Let me know if you still have any questions.

Best, Gavin

gavinha commented 4 years ago

@LABrumage Sorry about leaving out a code chunk! Please add one for 1a.

@gavinha, can you speak to this question:

for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments?

Yes, after finding the overlapping ranges for segs.gr, use the unique function, followed by the sort function. Then, print out the first 5 rows.

k8hertweck commented 4 years ago

I received the following question via email:

For question 1a, it is asking for overlaps with the regions chr8:128,746,347-128,755,810. Does this mean that for chr8:128, we are only looking for overlap at that specific base pair?

In other words, would that mean that the start and stop ranges would both be 128?

The commas are not separating numbers, but instead represent numbers higher than 128 million

k8hertweck commented 4 years ago

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

LABrumage commented 4 years ago

@gavinha is this on the right track for 1b (where object overlap1b is the GRanges object containing the overlaps I found)? ind.segs.overlap <-subjectHits(overlap1b) segs.gr[ind.segs.overlap] segs.means <- segs.gr[ind.segs.overlap]$Segment_Mean mean(segs.means) I got 0.1747237 as the mean of the segment means

gavinha commented 4 years ago

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here: https://github.com/fredhutchio/tfcb_2019/blob/b4828fba7e97fc6018d42d0863ba0bd08370e456/lectures/lecture08/Lecture8_GenomicData.Rmd#L164

Let me know if you have any further questions.

gavinha commented 4 years ago

@gavinha is this on the right track for 1b (where object overlap1b is the GRanges object containing the overlaps I found)? ind.segs.overlap <-subjectHits(overlap1b) segs.gr[ind.segs.overlap] segs.means <- segs.gr[ind.segs.overlap]$Segment_Mean mean(segs.means) I got 0.1747237 as the mean of the segment means

Those specific lines of code look correct. Ifoverlap1b is the output of the findOverlaps function, then it is a Hits object (not GRanges) with the indices of overlap in your query and subject.

Also, make sure you are using the correct function to assign ind.segs.overlap; which should you use subjectHits or queryHits? Make sure you use the one that matches the segs.gr when you called findOverlaps

gavinha commented 4 years ago

Hi class,

In Problem 4d, if you used data.table as described in: https://github.com/fredhutchio/tfcb_2019/blob/b4828fba7e97fc6018d42d0863ba0bd08370e456/lectures/lecture08/Lecture8_GenomicData.Rmd#L244-L246 you may notice that in the R Markdown itself shows something like this:


GT<list> | DP<list> | GQ<list> | ADALL<list> | AD<list> | IGT<list> | IPS<list> | PS<list>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]

Don't worry! Once you Knit, the output will be fine.

Hope this helps. Gavin

k8hertweck commented 4 years ago

I've talked to a few folks who are struggling to conceptualize homework 3. If you haven't worked with genomic data and/or haven't spent a lot of time working in R, it's possible this homework will be particularly challenging. Here are a few things that may help to keep in mind:

A few other more specific answers to questions I've received about genomic data/analyses:

gavinha commented 4 years ago

Thanks @k8hertweck If you also want more information about the Copy Number Segmentation file, see the documentation here: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/

zyaffe commented 4 years ago

What VCF file are we supposed to load for problem #4? 'GIAB_highconf_v.3.3.2.vcf.gz' from lecture 7?

k8hertweck commented 4 years ago

What VCF file are we supposed to load for problem #4? 'GIAB_highconf_v.3.3.2.vcf.gz' from lecture 7?

Yes, that's the correct file, @zyaffe

LABrumage commented 4 years ago

For Problem 4d where we have to combine the genotype info into a table, all of the necessary columns appear to be present but the entries in the table are <chr [1]>, <int[1]>, and so forth. Is this what the entries are supposed to read as? I used the vcf object where I loaded info for chromosome 8 (defined back in 4a). Here's my code for 4d: genoData <- data.table(do.call(cbind, geno(vcf))) colnames(genoData) <- rownames(geno(header(vcf))) genoData

alexgal8 commented 4 years ago

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

https://github.com/fredhutchio/tfcb_2019/blob/b4828fba7e97fc6018d42d0863ba0bd08370e456/lectures/lecture08/Lecture8_GenomicData.Rmd#L164

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results: GRanges object with 0 ranges and 3 metadata columns: seqnames ranges strand | Sample Num_Probes Segment_Mean

| ------- seqinfo: 24 sequences from hg19 genome Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?
Amandakr713 commented 4 years ago

For 2d, I found two windows that have the highest count of deletion segments. Should we be looking for only one window as the answer?

gavinha commented 4 years ago

For Problem 4d where we have to combine the genotype info into a table, all of the necessary columns appear to be present but the entries in the table are <chr [1]>, <int[1]>, and so forth. Is this what the entries are supposed to read as? I used the vcf object where I loaded info for chromosome 8 (defined back in 4a). Here's my code for 4d: genoData <- data.table(do.call(cbind, geno(vcf))) colnames(genoData) <- rownames(geno(header(vcf))) genoData

Hi @LABrumage

Please refer to my message 3 days ago about question 4d.

gavinha commented 4 years ago

For 2d, I found two windows that have the highest count of deletion segments. Should we be looking for only one window as the answer?

Hi @Amandakr713

You can return both or just one of them.

gavinha commented 4 years ago

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here: https://github.com/fredhutchio/tfcb_2019/blob/b4828fba7e97fc6018d42d0863ba0bd08370e456/lectures/lecture08/Lecture8_GenomicData.Rmd#L164

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns: seqnames ranges strand Sample Num_Probes Segment_Mean

seqinfo: 24 sequences from hg19 genome

Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8

The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps.

https://github.com/fredhutchio/tfcb_2019/blob/fce205cc278475672b4908e474edd62b61568c3c/lectures/lecture08/Lecture8_GenomicData.Rmd#L142-L150

Amandakr713 commented 4 years ago

In 4e, where do the DNAString and DNAStringSet objects exist?

gavinha commented 4 years ago

In 4e, where do the DNAString and DNAStringSet objects exist?

Hi @Amandakr713

In the lecture notes: https://github.com/fredhutchio/tfcb_2019/blob/fce205cc278475672b4908e474edd62b61568c3c/lectures/lecture08/Lecture8_GenomicData.Rmd#L220-L223

Find the column referring to the reference base REF and alternate base ALT in this object. To access the columns, use rowRanges(vcf)$REF.

alexgal8 commented 4 years ago

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here: https://github.com/fredhutchio/tfcb_2019/blob/b4828fba7e97fc6018d42d0863ba0bd08370e456/lectures/lecture08/Lecture8_GenomicData.Rmd#L164

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns: seqnames ranges strand Sample Num_Probes Segment_Mean

seqinfo: 24 sequences from hg19 genome Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8

The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps.

https://github.com/fredhutchio/tfcb_2019/blob/fce205cc278475672b4908e474edd62b61568c3c/lectures/lecture08/Lecture8_GenomicData.Rmd#L142-L150

Yes, I've got the overlaps but now I'm trying to query the correct rows from the original 'segs.gr'.

Amandakr713 commented 4 years ago

Thank you @gavinha ! For 4f, some of the REF and ALT columns contain multiple bases (ex: GCA). Did I execute the code incorrectly? I thought we should only expect single bases in these columns.

gavinha commented 4 years ago

Thank you @gavinha ! For 4f, some of the REF and ALT columns contain multiple bases (ex: GCA). Did I execute the code incorrectly? I thought we should only expect single bases in these columns.

Hi @Amandakr713 Some of the alleles do contain multiple bases representing short, multi-base INDELs.

Thank you for bringing up this concern. I just realized that this question is an older version that is much more challenging. I had come up with a slightly different version but did not manage to upload the changes. I will talk to Kate about this but likely, all of 4f will be a question only for bonus marks.

For your reference, the more straight-forward version of the question is simply a change in the region of interest to chr8:129127000-129128000. This region only contains 5 SNPs and the alleles are single bases. Manually writing out the answer would be much easier, as you can tell.

Reporting the answer with these 5 SNPs in chr8:129127000-129128000 will also get full marks.

I not sure if other students actually come to GitHub to read these comments so Kate will send an announcement about this.

Thanks for bringing this to my attention and sorry for the inconvenience.

alexgal8 commented 4 years ago

@gavinha In regard to problem 2 and finding copy number alterations, can you direct me to the part in the lecture that addresses this? I remember you going over it but now I can't find the section in the lecture materials. Thanks!

gavinha commented 4 years ago

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here: https://github.com/fredhutchio/tfcb_2019/blob/b4828fba7e97fc6018d42d0863ba0bd08370e456/lectures/lecture08/Lecture8_GenomicData.Rmd#L164

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns: seqnames ranges strand Sample Num_Probes Segment_Mean

seqinfo: 24 sequences from hg19 genome Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8 The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps. https://github.com/fredhutchio/tfcb_2019/blob/fce205cc278475672b4908e474edd62b61568c3c/lectures/lecture08/Lecture8_GenomicData.Rmd#L142-L150

Yes, I've got the overlaps but now I'm trying to query the correct rows from the original 'segs.gr'.

Assuming you used something like hits <- findOverlaps(query = ..., subjects = ...), then you can find the overlapping indices for segs.gr using either queryHits(hits) or subjectHits(hits) depending on whether segs.gr is the query or the subject in the findOverlaps command.

gavinha commented 4 years ago

@gavinha In regard to problem 2 and finding copy number alterations, can you direct me to the part in the lecture that addresses this? I remember you going over it but now I can't find the section in the lecture materials. Thanks!

@alexgal8 Please refer to Section 3 of the lecture notes: https://github.com/fredhutchio/tfcb_2019/blob/fce205cc278475672b4908e474edd62b61568c3c/lectures/lecture08/Lecture8_GenomicData.Rmd#L121

You need to make use of basic operations/functionality of Granges objects to answer the questions. There is nothing specific about copy number alterations for Problem 2. In fact, you are still using the same copy number alteration segments from Problem 1.

k8hertweck commented 4 years ago

Please see this corrected language for question 4f:

f. Find the phased SNPs for both haplotypes within the region chr8: 129,127,000-129,128,000. (4 points)

Given the lateness of this change, and difficulty in communicating updates, all answers to question 4f will be counted as extra credit, and all reasonable genomic regions will be acceptable for receiving at least partial extra credit for this question.

zyaffe commented 4 years ago

For 2c-e, are we using the same tiles in chromosome X that we made in 2b or should we be using separate tiling over chromosomes 1-22 + X instead?

k8hertweck commented 4 years ago

@zyaffe The question intended for all chromosomes, but we'll accept either interpretation.

alexgal8 commented 4 years ago

For 2d, I have the list of the overlaps and segment means but I am stuck on what to do next. How do I group ranges by tile? How to I sort by < -0.3? (I am still trying to learn R basics and am struggling to find everything I need to include with sort() to get it to work with these data).

gavinha commented 4 years ago

For 2d, I have the list of the overlaps and segment means but I am stuck on what to do next. How do I group ranges by tile? How to I sort by < -0.3? (I am still trying to learn R basics and am struggling to find everything I need to include with sort() to get it to work with these data).

During the lecture, I showed an example (not in the notes) on how to select segments from a GRanges object based on a condition applied to the Segment_Mean. Specifically, I showed how you would find deletions. To get you started:

## returns indices (i.e. row numbers) in segs.gr which have Segment_Mean less than -0.3 
ind <- which(segs.gr$Segment_Mean < -0.3)   
## look at the rows in segs.gr that have Segment_Mean < -0.3
segs.gr[ind] 

Once you have this, then perform the overlap.

I am at Fred Hutch today until 5pm and available for office hours tomorrow 9:30-11am. My office is M1-B869 in the Arnold Building.

yliu234 commented 4 years ago

Hi @gavinha and @k8hertweck, for the bonus question (2 points): provide code to determine the two haplotype sequences, can you elaborate on what the question is looking for? How does it differ from question 4f?

gavinha commented 4 years ago

Hi @yliu234

The question asks you to print out the sequence of alleles for each of the two haplotypes. You can simply read it off the screen and type it out manually. You can also write code to extract the alleles automatically from the VCF object; this code is how you would generalize to analyzing any region in the genome.

Hope this makes sense. Gavin