Additional FASTQ header formats for possible inclusion.

ppavlidis commented 3 years ago

We need more examples of these, or some external documentation, to decipher them.

GSE111979 has a header format I have not seen before, @arteymix do you know it? The device seems to be in the fifth field?

TAAATGTT:@NS500264:224:HWJ37BGXY:1:11105:12161:2145:GTCGATGAGACTAGCCATCGCATTGCCACCACTACCTCTGAGCTGAAGAGTGAACGTAAATGTTGACT length=75

As noted in https://github.com/PavlidisLab/GemmaCuration/issues/145

Originally posted by @ppavlidis in https://github.com/PavlidisLab/Gemma/issues/97#issuecomment-880258520

ppavlidis commented 3 years ago

From @sanjarogic : I am adding another example here: GSE153823. I am not sure if we've seen this format before, but it got "no batch info" badge in Gemma.

Here are the first few headers:

@SRR12150087.1.1 D00408:130:1:1101:1455:2201 length=50 @SRR12150089.1.1 D00408:130:1:1101:1929:2244#CGCTCATT length=50 @SRR12150094.1.1 D00408:130:1:1101:1340:2234#ATTCAGAA length=50 @SRR12150081.1.1 D00408:130:1:1101:2051:2206 length=50 @SRR12150097.1.1 D00408:130:1:1101:1521:2216 length=50 @SRR12150078.1.1 D00408:130:1:1101:1599:2216 length=50 @SRR12150098.1.1 D00408:130:1:1101:1800:2205 length=50

sanjarogic commented 3 years ago

There are a bunch of examples listed in the RNAseq spreadsheet. If @arteymix is going to be reviewing them he can find them there (batch info = wrong).

ppavlidis commented 3 years ago

Where is this spreadsheet?

arteymix commented 3 years ago

I'll take a look at the general structure of the 6-fields batch info. But like I mentioned in https://github.com/PavlidisLab/Gemma/issues/97#issuecomment-874985853, this is a tiny fraction of the datasets.

ppavlidis commented 3 years ago

Yes it's a tiny fraction, but these are data sets we decided to load into Gemma, so we should complete their curation. It's not the most pressing priority but these will be stuck in curation limbo until it is addressed.

sanjarogic commented 3 years ago

Where is this spreadsheet?

https://docs.google.com/spreadsheets/d/17xm2eFFqhhT-M6-jTC_lsar7RMgk8Ln-TwQPDWlRfIY/edit#gid=2128359689

arteymix commented 3 years ago

We should have a "non-standard batch info" tag and call it a day for those datasets.

sanjarogic commented 3 years ago

I don't think that the fraction is tiny, I am seeing more of them lately.

arteymix commented 3 years ago

Okay, I'll recompile the statistics and take a second look.

ppavlidis commented 3 years ago

It might be non-standard, but I keep getting asked to deal with them. I agree it's not "wrong" - our software is doing the exact right thing at the moment, these are not supported. Yet. But I want to get clarity on whether we can support them. Let's quickly determine that so I can move on to other things.

sanjarogic commented 3 years ago

Don't pay much attention to the label "wrong" - it was just meant to label experiments that have fastq headers which are not recognized by the pipeline

arteymix commented 3 years ago

Looking at: datasets-with-6-fields.txt

All I can say, really, is that the first fields contains a device name, and sometimes a device name and a lane (separated by an underscore). The second field sometimes contains a flowcell, sometime a run number. The 3rd seems to contains the lane number, but it is sometimes inverted with the 2nd one.

The 4th, 5th and 6th are useless and seem to contain x/y positions in the lane or a tag, or a run number.

There's just no consistent pattern, except for the device.

sanjarogic commented 3 years ago

I just reviewed these two experiments:

GSE179370 ("no batch effect" badge):

@SRR15031682.1.1 A00977:146:HH7HHDSXY:3:1101:2754:1000 length=101 @SRR15031682.1.2 A00977:146:HH7HHDSXY:3:1101:2754:1000 length=100 @SRR15031679.1.1 A00977:146:HH7HHDSXY:3:1101:1217:1000 length=86 @SRR15031679.1.2 A00977:146:HH7HHDSXY:3:1101:1217:1000 length=86 @SRR15031685.1.1 A00977:290:H7TG7DSX2:2:1101:4598:1000 length=101 @SRR15031685.1.2 A00977:290:H7TG7DSX2:2:1101:4598:1000 length=101 @SRR15031678.1.1 A00977:146:HH7HHDSXY:3:1101:1687:1000 length=98 @SRR15031678.1.2 A00977:146:HH7HHDSXY:3:1101:1687:1000 length=101

GSE171999 ("no batch info" badge):

@SRR14229890.1.1 A00709:130:HHHH7DSXY:3:1101:1814:1000 length=150 @SRR14229890.1.2 A00709:130:HHHH7DSXY:3:1101:1814:1000 length=150 @SRR14229893.1.1 A00709:130:HHHH7DSXY:3:1101:1796:1000 length=150 @SRR14229893.1.2 A00709:130:HHHH7DSXY:3:1101:1796:1000 length=150 @SRR14229884.1.1 A00709:130:HHHH7DSXY:3:1101:1344:1000 length=150 @SRR14229884.1.2 A00709:130:HHHH7DSXY:3:1101:1344:1000 length=150 @SRR14229878.1.1 A00709:130:HHHH7DSXY:3:1101:1597:1000 length=150 @SRR14229878.1.2 A00709:130:HHHH7DSXY:3:1101:1597:1000 length=150 @SRR14229883.1.1 A00709:130:HHHH7DSXY:3:1101:3242:1000 length=150

The formats seems to be the same but for some reason it's not recognized for the second experiment

ppavlidis commented 3 years ago

Looking at: datasets-with-6-fields.txt

All I can say, really, is that the first fields contains a device name, and sometimes a device name and a lane (separated by an underscore). The second field sometimes contains a flowcell, sometime a run number. The 3rd seems to contains the lane number, but it is sometimes inverted with the 2nd one.

The 4th, 5th and 6th are useless and seem to contain x/y positions in the lane or a tag, or a run number.

There's just no consistent pattern, except for the device.

Great, thanks. We actually don't care what the fields mean exactly, as long as they are reasonable to form batches on. If we treat fields 1,2,3 as "usable", it should be okay, mostly. If run and lane are inverted, that could screw things up (since we go down in "resolution" by lane, flowcell, run, device, in the expectation there are fewer "bins" as you go down in resolution). But it should work out. Manual review might catch cases where it goes wrong.

ppavlidis commented 3 years ago

I just reviewed these two experiments:

GSE179370 ("no batch effect" badge):

@SRR15031682.1.1 A00977:146:HH7HHDSXY:3:1101:2754:1000 length=101 @SRR15031682.1.2 A00977:146:HH7HHDSXY:3:1101:2754:1000 length=100 @SRR15031679.1.1 A00977:146:HH7HHDSXY:3:1101:1217:1000 length=86 @SRR15031679.1.2 A00977:146:HH7HHDSXY:3:1101:1217:1000 length=86 @SRR15031685.1.1 A00977:290:H7TG7DSX2:2:1101:4598:1000 length=101 @SRR15031685.1.2 A00977:290:H7TG7DSX2:2:1101:4598:1000 length=101 @SRR15031678.1.1 A00977:146:HH7HHDSXY:3:1101:1687:1000 length=98 @SRR15031678.1.2 A00977:146:HH7HHDSXY:3:1101:1687:1000 length=101

GSE171999 ("no batch info" badge):

@SRR14229890.1.1 A00709:130:HHHH7DSXY:3:1101:1814:1000 length=150 @SRR14229890.1.2 A00709:130:HHHH7DSXY:3:1101:1814:1000 length=150 @SRR14229893.1.1 A00709:130:HHHH7DSXY:3:1101:1796:1000 length=150 @SRR14229893.1.2 A00709:130:HHHH7DSXY:3:1101:1796:1000 length=150 @SRR14229884.1.1 A00709:130:HHHH7DSXY:3:1101:1344:1000 length=150 @SRR14229884.1.2 A00709:130:HHHH7DSXY:3:1101:1344:1000 length=150 @SRR14229878.1.1 A00709:130:HHHH7DSXY:3:1101:1597:1000 length=150 @SRR14229878.1.2 A00709:130:HHHH7DSXY:3:1101:1597:1000 length=150 @SRR14229883.1.1 A00709:130:HHHH7DSXY:3:1101:3242:1000 length=150

The formats seems to be the same but for some reason it's not recognized for the second experiment

The second one should be considered single batch. I will have to investigate why it isn't showing up that way. But that's a new issue, different from this one, which is about unsupported header formats.

ppavlidis commented 3 years ago

The formats seems to be the same but for some reason it's not recognized for the second experiment

OK I see, you didn't list all the headers. GSE171999 has a singleton that was run separately, so we can't form batches. Please be sure to check that before filing an issue. This is the expected behaviour.

sanjarogic commented 3 years ago

The formats seems to be the same but for some reason it's not recognized for the second experiment

OK I see, you didn't list all the headers. GSE171999 has a singleton that was run separately, so we can't form batches. Please be sure to check that before filing an issue. This is the expected behaviour.

I missed that at first, but now I am seeing 2 samples (not a singleton) on a different lane.

sanjarogic commented 3 years ago

Also, as I mention before, it's confusing to have "no batch info" for this type of problem, it's hard for me to see what's going on.

ppavlidis commented 3 years ago

At this point we can't differentiate all possible reasons for lack of batches from the user interface. I am open to new wording for the badge. I thought I had changed the tooltip to be more explicit about the possibilities, but I guess not.

sanjarogic commented 3 years ago

@ppavlidis what's the min number of samples allowed per batch?

ppavlidis commented 3 years ago

The formats seems to be the same but for some reason it's not recognized for the second experiment

OK I see, you didn't list all the headers. GSE171999 has a singleton that was run separately, so we can't form batches. Please be sure to check that before filing an issue. This is the expected behaviour.

I missed that at first, but now I am seeing 2 samples (not a singleton) on a different lane.

Those are the + and - reads for the same sample. You can see that from the matching GSM.

ppavlidis commented 3 years ago

@ppavlidis what's the min number of samples allowed per batch?

2

sanjarogic commented 3 years ago

Got it. I'll review the other experiments that I marked as "wrong" in case I missed similar singleton issues

sanjarogic commented 3 years ago

At this point we can't differentiate all possible reasons for lack of batches from the user interface. I am open to new wording for the badge. I thought I had changed the tooltip to be more explicit about the possibilities, but I guess not.

The tooltip still says "Information on sample batching was not available"

sanjarogic commented 3 years ago

I reviewed the experiments which I marked earlier and many of them have the singleton issue although sometimes it's really hard to spot it unless I sort the right fields (eg GSE173140, GSE173137).

There are a bunch of experiments that should have been labelled as a single batch (unless I am missing something).

One experiment deem to have a format that I haven't seen before, GSE111979, but it also has a singleton, so I am not sure which one caused the "no batch info" badge.

ppavlidis commented 3 years ago

OK I'll look into that. Please start new curation issues if necessary to keep this one from getting too complicated.

ppavlidis commented 3 years ago

Just to confirm that the code already takes anything with exactly 6 fields and uses the first 3, as of 75c909ff0e34bcb80a7379beed3bd1c72d60e0b0. If there are 7 or more fields, we use the first 4. If there are exactly 5 we use the first 2. If there are exactly 3 we don't use anything, haven't seen anything valid,

GSE111979 has over 500 samples - it is single-cell RNA-seq so not currently usable anyway. It is using the format mentioned in this original report (see above), plus some have incomplete headers.

ppavlidis commented 3 years ago

Apparently those code changes have not been rolled out - the current build is from July 6. Let's pause any adding of issues here until the last release is up.

ppavlidis commented 3 years ago

OK so the problem with GSE111979 is the fields we get are

[GTTCCCGT, @NS500264, 224, HWJ37BGXY, 1, 11101, 2193, 3352, TGCGTAAGCTTAGCCATCGCATTGCTATTTCTACCTCTGAGCTGAAACCCAAACGGTTCCCGTGACTT]

The first two fields are not device etc. This seems likely to be to do with single-cell demultiplexing (UMIs?).

Regardless, it's not being handled correctly by our code , but I'm not sure I want to. Fields 3,4 and 5 are possibly usable but we've never seen this in other data sets, and single-cell data is for the future.

sanjarogic commented 3 years ago

GSE111979 has over 500 samples - it is single-cell RNA-seq so not currently usable anyway. It is using the format mentioned in this original report (see above), plus some have incomplete headers.

I didn't realize this was a single-cell experiment. It makes sense to ignore this format for now.

ppavlidis commented 3 years ago

Guys, I'm not going to be adding any more header formats for now. I think all the cases under discussion are handled by "use the first 3 fields" (and worst case, if that isn't interpretable as batches, we just fail, which is the right thing to do anyway).

Second: I've added some new bells and whistles here which should help make it easier to understand the status from the UI d3751816608c1dcc777eb92ae13614b0679ce439

There are now two new AuditEventTypes to indicate 1) failure to form batches due to singletons and 2) failure to form batches because the FASTQ headers weren't usable. This will be reflected in the badges. This is in addition to existing status to show single-batch, lack of a batch effect, or successful batch correction, as well as two other more generic failure modes.

However: this is difficult for me to test the new feature fully because I don't want to add new event types to the live database. Hibernate will throw errors when it encounters them (classes which are not declared in the application version running). But I have updated the unit tests to exercise them.

In other words: we won't be able to test this until we update production and then update the batch status for some of the experiments that are relevant. They can end up with a UninformativeFASTQHeadersForBatchingEvent or SingletonBatchInvalidEvent which we can hopefully see the effect of on the UI.

ppavlidis commented 3 years ago

There were a couple of issues with the badge display, this is fixed in the repo. But the new event types are added correctly when we redo the batch info. I'll continue to test it.

PavlidisLab / Gemma

Additional FASTQ header formats for possible inclusion. #171