davidpng / FCS_Database

Program to scrape an FCS directory of metadata
GNU General Public License v3.0
3 stars 2 forks source link

Antigen/Flourophore Parsing Issues #16

Open davidpng opened 9 years ago

davidpng commented 9 years ago

It seems that our parsing method doesn't do a good job splitting antigens and fluorophores. Things seem to fall apart with a long tail of research and custom samples (Kappa and Lambda are funny, don't know what is up with that)

select fluorophore,count(*) as count from pmttubecases group by fluorophore order by count desc limit 50;

Fluorophore count
FSC-A 194352
SSC 194352
FSC 194350
SSC-A 194350
Time 194350
A594 194338
PE-Cy7 194252
APC 193965
PE 184081
FITC 169333
PB 139944
A700 116112
PE-TR 110000
APC-H7 102333
APC-Cy7 91988
PE-Cy5 85513
Red-H 83984
APC-A700 73705
PE-Cy55 59550
V450 40729
PerCP-Cy55 34607
A488 14742
PCP-Cy55 12462
Kappa FITC 9191
Lambda PE 9191
Blue-H 7693
BV 5840
Draq5 4509
PerCP-Cy5-5 1929
A PE 967
16 FITC 948
+ 16 A647 APC 255
PE-Texas Red 211
PC5 154
ECD 118
BV421 71
cCD3APC 59
PerCP-Cy5 58
PC7 50
APC 29
PE 29
FITC 25
DAPI 17
DR PB 17
A647 APC 14
DRAQ5 14
TdT FITC 14
delta PE 13
beta FITC 12
CD3-PE-Cy7 10
davidpng commented 9 years ago

It looks like a lot of the antigen-fluorophore parsings did not work correctly esp PE-Texas Red:

select TubeTypesInstances.tube_type, antigens, MIN(date) as min_date, MAX(date) as max_date, COUNT() as count from TubeCases INNER JOIN TubeTypesInstances USING (tube_type_instance) group by TubeTypesInstances.tube_type_instance order by count desc limit 40;

tube_type Antigens min_date max_date count
Myeloid 1 CD117;CD13;CD15;CD19;CD33;CD34;CD38;CD45;CD71;LA-DR;Unknown 2005-11-17 14:43:26 2013-01-02 09:00:36 26862
B Cells New CD10;CD19;CD20;CD38;CD45;CD5;Kappa;Lambda;PE-Texas;Unknown 2008-10-23 17:08:13 2013-01-02 11:19:21 25579
Myeloid 2 CD123;CD13;CD14;CD16;CD34;CD38;CD4;CD45;CD64;LA-DR;Unknown 2006-07-15 11:15:32 2013-01-02 09:00:37 24020
T Cells New CD2;CD3;CD30;CD34;CD4;CD45;CD5;CD56;CD7;CD8;Unknown 2007-11-28 16:46:33 2013-01-02 11:19:21 21070
B cells rpt CD10;CD19;CD20;CD38;CD45;CD5;Kappa;Lambda;Unknown 2006-02-11 14:40:32 2009-08-20 09:22:47 18143
Myeloid 4 CD33;CD34;CD38;CD45;CD5;CD56;CD7;PE-Texas;Unknown 2006-06-14 12:43:37 2011-07-15 13:31:13 11017
T5 CD2;CD3;CD34;CD4;CD45;CD5;CD56;CD7;CD8;Unknown 2005-11-17 14:27:54 2009-01-24 16:39:45 10711
Plasma Cell NEW CD138;CD19;CD38;CD45;CD56;DAPI;PE-Texas;Unknown;cyto 2008-12-12 13:12:25 2012-12-31 15:37:12 5157
NEWa CD10;CD19;CD20;CD38;CD45;CD58;PE-Texas;Unknown 2006-04-19 17:33:19 2011-12-31 16:09:56 4239
COG B CD10;CD13+33;CD19;CD34;CD45;CD9;PE-Texas;Unknown 2006-12-12 15:25:16 2012-07-06 13:39:38 4232
Plasma Cells NEW CD19;CD38;CD45;CD56;DAPI;PE-Texas;Unknown;cyto 2005-11-18 14:32:51 2009-03-03 18:48:58 3963
WBC CD34;CD45;CD71;PE-Texas;Unknown 2005-11-18 11:06:42 2008-02-15 11:33:17 3643
Myeloid 4 CD33;CD34;CD38;CD45;CD5;CD56;CD7;PE-Texas;Pacific;Unknown 2011-07-13 15:18:31 2012-12-31 16:51:35 3047
B-ALL CD10;CD19;CD20;CD34;CD38;CD45;CD58;PE-Texas;Unknown 2008-11-20 13:15:52 2012-07-06 13:40:41 2905
D CD19;CD3;CD45;CD71;PE-Texas;SYTO16;Unknown 2006-12-12 15:28:46 2012-07-06 13:40:10 2698
B ALL MRD CD10;CD19;CD20;CD34;CD38;CD45;CD58;PE-Texas;Pacific;Unknown 2011-07-15 12:41:02 2013-01-02 08:54:03 2119
B ALL MRD CD10;CD19;CD20;CD34;CD38;CD45;CD58;Unknown 2005-11-19 12:59:18 2008-11-20 17:54:51 1927
T4 CD16;CD3;CD38;CD45;CD5;CD56;CD7;PE-Texas;Unknown;cCD3 2007-12-17 14:40:18 2012-12-28 17:49:33 1743
PNH CD14;CD33;CD45;CD66b;PE-Texas;Unknown 2005-11-18 15:28:36 2010-07-31 12:38:50 1337
WBC CD34;CD45;CD71;DRAQ5;PE-Texas;Unknown 2005-11-17 14:42:56 2010-05-20 16:10:43 1245
addon CD10;CD19;CD20;CD38;CD40;CD45;CD5;Kappa;Lambda;Unknown 2005-11-17 14:27:31 2010-02-18 11:46:35 1216
Myeloid 2 CD123;CD13;CD14;CD16;CD34;CD38;CD45;CD64;LA-DR;Unknown 2006-02-09 13:38:49 2008-03-15 12:49:19 1149
D8 CD10;CD19;CD20;CD34;CD45;PE-Texas;SYTO16;Unknown 2006-12-12 16:30:13 2011-12-24 14:06:28 1110
Other B cell CD103;CD11c;CD19;CD25;CD45;PE-Texas;Unknown 2005-11-17 14:58:40 2012-12-23 13:03:02 1007
Bone Marrow WBC CD16;CD33;CD38;CD45;CD71;PE-Texas;Unknown 2010-09-08 14:32:27 2012-01-17 13:25:08 852
T3 CD3;CD45;CD56;CD7;CD71;PE-Texas;SYTO16;Unknown 2007-01-26 19:21:53 2011-12-31 16:11:21 847
Hodgkin CD15;CD20;CD30;CD40;CD45;CD5;CD64;CD71;CD95;Unknown 2006-10-07 14:10:12 2013-01-02 14:46:51 665
NEW PNH WBC CD14;CD15;CD24;CD45;CD64;FLAER;PE-Texas;Unknown 2010-07-29 13:14:44 2012-12-30 11:26:40 650
BAL COUNT 7AAD;GLY;PE-Texas;Pacific;SYTO;Unknown 2011-10-28 17:03:33 2012-12-31 16:49:15 625
CLL TUBE CD19;CD200;CD23;CD5;FMC7;Pacific;Unknown 2011-07-13 17:13:52 2012-12-28 11:54:21 542
Myeloid 2 CD123;CD13;CD14;CD16;CD34;CD36;CD38;CD45;CD64;LA-DR;Unknown 2005-11-17 14:43:46 2006-02-15 11:16:04 475
new TdT CD10;CD38;CD45;CD7;PE-Texas;TdT;Unknown;cCD3 2007-06-19 15:08:15 2008-06-06 13:47:51 444
NEW PNH RBC CD59;GlyA;PE-Texas;Pacific;Unknown 2011-07-14 11:10:46 2012-12-30 11:26:38 343
New MASTO CD117;CD2;CD25;CD45;PE-Texas;Pacific;Unknown 2012-01-17 15:41:09 2012-12-28 09:07:40 343
Neg MASTO CD117;CD45;PE-Texas;Pacific;Unknown 2012-01-17 15:41:46 2012-12-28 09:07:41 340
Bcl-2 addon Bcl2;CD10;CD19;CD20;CD38;CD45;CD5;PE-Texas;Unknown 2006-06-26 16:27:24 2011-07-16 12:02:12 317
BAL COUNT 7AAD;GLY;PE-Texas;SYTO;Unknown 2010-02-02 16:50:06 2011-07-11 14:30:13 316
T5 New CD16+56;CD2;CD3;CD34;CD4;CD45;CD5;CD7;CD8;Unknown 2009-01-31 14:50:29 2012-05-09 16:31:13 313
CLL TUBE CD19;CD200;CD23;CD5;FMC7;Unknown 2010-12-28 15:40:16 2011-07-15 10:30:45 291
NEW PNH RBC CD59;GlyA;PE-Texas;Unknown 2010-05-06 16:42:24 2011-07-14 14:54:47 284
hermands commented 9 years ago

Okay. Let's decide what to clean up and what to document and punt till later.

-Dan

On Nov 15, 2014, at 8:28 AM, David Ng notifications@github.com wrote:

It looks like a lot of the antigen-fluorophore parsings did not work correctly esp PE-Texas Red:

select TubeTypesInstances., MIN(date) as min_date, MAX(date) as max_date, COUNT() as count from TubeCases INNER JOIN TubeTypesInstances USING (tube_type_instance) group by TubeTypesInstances.tube_type_instance order by count desc limit 40 ;

tube_type_instance tube_type Antigens min_date max_date count 1 Myeloid 1 CD117;CD13;CD15;CD19;CD33;CD34;CD38;CD45;CD71;LA-DR;Unknown 2005-11-17 14:43:26 2013-01-02 09:00:36 26862 7 B Cells New CD10;CD19;CD20;CD38;CD45;CD5;Kappa;Lambda;PE-Texas;Unknown 2008-10-23 17:08:13 2013-01-02 11:19:21 25579 3 Myeloid 2 CD123;CD13;CD14;CD16;CD34;CD38;CD4;CD45;CD64;LA-DR;Unknown 2006-07-15 11:15:32 2013-01-02 09:00:37 24020 5 T Cells New CD2;CD3;CD30;CD34;CD4;CD45;CD5;CD56;CD7;CD8;Unknown 2007-11-28 16:46:33 2013-01-02 11:19:21 21070 964 B cells rpt CD10;CD19;CD20;CD38;CD45;CD5;Kappa;Lambda;Unknown 2006-02-11 14:40:32 2009-08-20 09:22:47 18143 363 Myeloid 4 CD33;CD34;CD38;CD45;CD5;CD56;CD7;PE-Texas;Unknown 2006-06-14 12:43:37 2011-07-15 13:31:13 11017 949 T5 CD2;CD3;CD34;CD4;CD45;CD5;CD56;CD7;CD8;Unknown 2005-11-17 14:27:54 2009-01-24 16:39:45 10711 9 Plasma Cell NEW CD138;CD19;CD38;CD45;CD56;DAPI;PE-Texas;Unknown;cyto 2008-12-12 13:12:25 2012-12-31 15:37:12 5157 867 NEWa CD10;CD19;CD20;CD38;CD45;CD58;PE-Texas;Unknown 2006-04-19 17:33:19 2011-12-31 16:09:56 4239 111 COG B CD10;CD13+33;CD19;CD34;CD45;CD9;PE-Texas;Unknown 2006-12-12 15:25:16 2012-07-06 13:39:38 4232 1014 Plasma Cells NEW CD19;CD38;CD45;CD56;DAPI;PE-Texas;Unknown;cyto 2005-11-18 14:32:51 2009-03-03 18:48:58 3963 1015 WBC CD34;CD45;CD71;PE-Texas;Unknown 2005-11-18 11:06:42 2008-02-15 11:33:17 3643 4 Myeloid 4 CD33;CD34;CD38;CD45;CD5;CD56;CD7;PE-Texas;Pacific;Unknown 2011-07-13 15:18:31 2012-12-31 16:51:35 3047 317 B-ALL CD10;CD19;CD20;CD34;CD38;CD45;CD58;PE-Texas;Unknown 2008-11-20 13:15:52 2012-07-06 13:40:41 2905 316 D CD19;CD3;CD45;CD71;PE-Texas;SYTO16;Unknown 2006-12-12 15:28:46 2012-07-06 13:40:10 2698 14 B ALL MRD CD10;CD19;CD20;CD34;CD38;CD45;CD58;PE-Texas;Pacific;Unknown 2011-07-15 12:41:02 2013-01-02 08:54:03 2119 1017 B ALL MRD CD10;CD19;CD20;CD34;CD38;CD45;CD58;Unknown 2005-11-19 12:59:18 2008-11-20 17:54:51 1927 19 T4 CD16;CD3;CD38;CD45;CD5;CD56;CD7;PE-Texas;Unknown;cCD3 2007-12-17 14:40:18 2012-12-28 17:49:33 1743 630 PNH CD14;CD33;CD45;CD66b;PE-Texas;Unknown 2005-11-18 15:28:36 2010-07-31 12:38:50 1337 760 WBC CD34;CD45;CD71;DRAQ5;PE-Texas;Unknown 2005-11-17 14:42:56 2010-05-20 16:10:43 1245 668 addon CD10;CD19;CD20;CD38;CD40;CD45;CD5;Kappa;Lambda;Unknown 2005-11-17 14:27:31 2010-02-18 11:46:35 1216 1088 Myeloid 2 CD123;CD13;CD14;CD16;CD34;CD38;CD45;CD64;LA-DR;Unknown 2006-02-09 13:38:49 2008-03-15 12:49:19 1149 924 D8 CD10;CD19;CD20;CD34;CD45;PE-Texas;SYTO16;Unknown 2006-12-12 16:30:13 2011-12-24 14:06:28 1110 13 Other B cell CD103;CD11c;CD19;CD25;CD45;PE-Texas;Unknown 2005-11-17 14:58:40 2012-12-23 13:03:02 1007 2 Bone Marrow WBC CD16;CD33;CD38;CD45;CD71;PE-Texas;Unknown 2010-09-08 14:32:27 2012-01-17 13:25:08 852 448 T3 CD3;CD45;CD56;CD7;CD71;PE-Texas;SYTO16;Unknown 2007-01-26 19:21:53 2011-12-31 16:11:21 847 24 Hodgkin CD15;CD20;CD30;CD40;CD45;CD5;CD64;CD71;CD95;Unknown 2006-10-07 14:10:12 2013-01-02 14:46:51 665 11 NEW PNH WBC CD14;CD15;CD24;CD45;CD64;FLAER;PE-Texas;Unknown 2010-07-29 13:14:44 2012-12-30 11:26:40 650 8 BAL COUNT 7AAD;GLY;PE-Texas;Pacific;SYTO;Unknown 2011-10-28 17:03:33 2012-12-31 16:49:15 625 15 CLL TUBE CD19;CD200;CD23;CD5;FMC7;Pacific;Unknown 2011-07-13 17:13:52 2012-12-28 11:54:21 542 1012 Myeloid 2 CD123;CD13;CD14;CD16;CD34;CD36;CD38;CD45;CD64;LA-DR;Unknown 2005-11-17 14:43:46 2006-02-15 11:16:04 475 947 new TdT CD10;CD38;CD45;CD7;PE-Texas;TdT;Unknown;cCD3 2007-06-19 15:08:15 2008-06-06 13:47:51 444 10 NEW PNH RBC CD59;GlyA;PE-Texas;Pacific;Unknown 2011-07-14 11:10:46 2012-12-30 11:26:38 343 50 New MASTO CD117;CD2;CD25;CD45;PE-Texas;Pacific;Unknown 2012-01-17 15:41:09 2012-12-28 09:07:40 343 51 Neg MASTO CD117;CD45;PE-Texas;Pacific;Unknown 2012-01-17 15:41:46 2012-12-28 09:07:41 340 373 Bcl-2 addon Bcl2;CD10;CD19;CD20;CD38;CD45;CD5;PE-Texas;Unknown 2006-06-26 16:27:24 2011-07-16 12:02:12 317 372 BAL COUNT 7AAD;GLY;PE-Texas;SYTO;Unknown 2010-02-02 16:50:06 2011-07-11 14:30:13 316 86 T5 New CD16+56;CD2;CD3;CD34;CD4;CD45;CD5;CD7;CD8;Unknown 2009-01-31 14:50:29 2012-05-09 16:31:13 313 360 CLL TUBE CD19;CD200;CD23;CD5;FMC7;Unknown 2010-12-28 15:40:16 2011-07-15 10:30:45 291 368 NEW PNH RBC CD59;GlyA;PE-Texas;Unknown 2010-05-06 16:42:24 2011-07-14 14:54:47 284 — Reply to this email directly or view it on GitHub.

davidpng commented 9 years ago

I'm not sure by what to clean up and what to document?

hermands commented 9 years ago

Input:

Database

Look at data after 2006.

Antigens to fix: select tube_type, Antigen, COUNT(*) as count from PmtTubeCases INNER JOIN TubeCases USING (case_tube) INNER JOIN TubeTypesInstances USING (tube_type_instance) WHERE tube_type LIKE 'Myeloid%' GROUP BY tube_type, Antigen ORDER BY count desc limit 50;

Fluorophores to fix: sqlite> select tube_type, fluorophore, COUNT(*) as count from PmtTubeCases INNER JOIN TubeCases USING (case_tube) INNER JOIN TubeTypesInstances USING (tube_type_instance) WHERE tube_type LIKE 'Myeloid%' GROUP BY tube_type, fluorophore ORDER BY count desc limit 50;

hermands commented 9 years ago

Addressed, but need to look at in new data

hermands commented 9 years ago

Lots of fluorophores are listed in the tube_type Antigen concatenation -- suggesting that parsing is not working for many antigen/fluoro's.

Need to make sure it is working for Myeloids. If is, handle this later.

hermands commented 9 years ago

Are these appropriate:

Antigen issues: A|684 Neg|81 neg|74 NEG|895

Fluorophore: A PE|1758

davidpng commented 9 years ago

Neg means negative control; no antibody and no flurophore. I can add some code to make them all lower case. I think the should be kept as an "antigen" as these tubes are typically done paired with a tube where you are trying to measure any small level of expression.

"A" and "A PE" sounds like a parsing issue; can you tell me what tube that is coming from or other antigens in that tube?

On Sun, Jan 25, 2015 at 8:40 PM, Daniel Herman notifications@github.com wrote:

Are these appropriate:

Antigen issues: A|684 Neg|81 neg|74 NEG|895

Fluorophore: A PE|1758

— Reply to this email directly or view it on GitHub https://github.com/davidpng/FCS_Database/issues/16#issuecomment-71414139 .