Samples without mutations can only be specified as "sequenced" if they contain other data (e.g. CNA)

oplantalech commented 4 years ago

According to cBioPortal/cbioportal#6934:

The fact that the samples are in the sequenced case list does not turn mutation counts from NA to 0.
The only way to change NA to 0 is to add a comment line in the first line of the maf file with #sequenced_samples: followed by the sample IDs separated by spaces AND that one of the samples that does not contain mutations should have another data (e.g. CNA).

There must be a way to indicate that samples without mutations are sequenced if e.g. the study only contains mutation data.

jjgao commented 4 years ago

the case list _sequenced should include all sequenced samples, but maybe the missing step here is add all samples in _sequenced into sample_profile table?

n1zea144 commented 4 years ago

@jjgao @dippindots I was thinking more about this and our brief discussion during sprint planning about moving towards dynamic case list generation. If we do move to dynamic caselist generation, I think we do want to maintain the #sequenced_samples header. If not, how else can we inform our system as to the sequenced sample list. Thoughts?

jjgao commented 4 years ago

@n1zea144 agreed. We could also support cases_sequences.txt as a special case list to populate sample_profile table.

n1zea144 commented 4 years ago

I'm leaning towards the #sequenced_samples header, for similar reasons we originally identified - its makes a MAF self describing. That is, if the sequenced sample list lives in a separate case list, then both files need to be distributed together, which is clumsy and not very convenient, for say R users.

jjgao commented 4 years ago

@n1zea144 sounds good to me. If we get rid of cases_sequence.txt, we should make sure all info is captured - we might have studies with no #sequenced_samples header but more samples in cases_sequence.txt than in MAF.

n1zea144 commented 4 years ago

I think we should fix the code so that cases_sequence.txt is considered to address this bug, as you point out below. I wouldn't touch the code which supports #sequenced_samples for reasons discussed above. We should have a separate, concerted effort to remove the case-list requirement.

the case list _sequenced should include all sequenced samples, but maybe the missing step here is add all samples in _sequenced into sample_profile table?

Sjoerd-van-Hagen commented 4 years ago

@n1zea144 I am not sure I agree with you on this one. Putting this in the MAF file has drawbacks too, e.g. case lists are a known mechanism that is documented and more widely applicable then just sequenced. The line in the MAF file is not well known in the community. Making the MAF selfdescribing does not help that much, as you will always need additional files before you can load a study.

For generating case lists the logic would simply be: if case list does not exist generate it.

Another issue is that this discussion is getting disconnect from the initial bug report. Whether you generate the case list automatically or not, cBioPortal should look at the case lists to decide if a sample has been sequenced and no mutations are found, vs not being sequenced. This seems like a bug that should be solved regardless. I see @inodb removed this from the sprint. Can you indicate when you expect you can work on this? We may also be able to work on it, but then it would be good to know if you already made some progress. Let me know.

inodb commented 4 years ago

@dippindots any comments on this ☝️ ? I forget why we removed this from the sprint last week. It doesn't look like it ever went to the progress column, so I don't think any implementation work has been done yet. @Sjoerd-van-Hagen if u could work on this that would be awesome

dippindots commented 4 years ago

@inodb @Sjoerd-van-Hagen I remember the reason we removed this from sprint backlog is we haven't come up with a good solution yet, so there is no progress yet.

Sjoerd-van-Hagen commented 4 years ago

Ok. I will wait for @n1zea144 to respond, and see if we can pick this up.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

inodb commented 4 years ago

Any more thoughts on this?

Sjoerd-van-Hagen commented 4 years ago

I think we need to discuss this. @n1zea144 favors having a line in the MAF file, while I think using case lists is better. The implementation of either should not be that difficult as soon as we reach a decision.

jjgao commented 4 years ago

I also think it's better to have it in the MAF file so that it is self-contained. The connection between case list and MAF is not explicit. And what if we have two MAFs in the future?

Sjoerd-van-Hagen commented 4 years ago

I really have a hard time understanding why we have case lists for each and every data type, CNA, expression, methylation, but you want to deviate for mutations.

Especially when you have multiple MAF files it will just be messy if you would include the case lists in the MAF. If you want to know if any particular sample is sequenced you will have to check all the MAFs. The case list is just the case list, a simple file, the same for all data types, that contains the profiled samples. Simple and consistent.

jjgao commented 4 years ago

@Sjoerd-van-Hagen I think we are talking about the same thing here... I don't think case list is well designed for representing the sample list that are profiled and we would like to redesign (remove) case list at certain point as discussed in this RFC.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jjgao commented 3 years ago

moving it to icebox for now. Let's resolve this when working of rfc37

Sjoerd-van-Hagen commented 3 years ago

But that RFC is ancient...

--

E. sjoerd@thehyve.nl

T. +31(0)30 700 9713

W. www.thehyve.nl

On Wed, Nov 25, 2020 at 4:03 PM JianJiong Gao notifications@github.com wrote:

moving it to icebox for now. Let's resolve this when working of rfc37 https://docs.google.com/document/d/1aBbkTAFv5nCqBv66BOgwvlt5kymgt0lQ7l_pqaFrupc/edit#heading=h.l4kworqrwb2g

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cBioPortal/icebox/issues/118#issuecomment-733760332, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMGKNN5DYQ54URBDVY2RO3SRUMCNANCNFSM4UCQMDLA .

cBioPortal / icebox

Samples without mutations can only be specified as "sequenced" if they contain other data (e.g. CNA) #118