FRED-2 / OptiType

Precision HLA typing from next-generation sequencing data
BSD 3-Clause "New" or "Revised" License
181 stars 74 forks source link

No alleles called because of possible alignment issue #86

Open toddcreasy opened 5 years ago

toddcreasy commented 5 years ago

Hi,

I'm trying to map and call alleles on fastq files from TCGA. However, out of the 400 samples I've run, only a very small number were called. The failed ones (as indicated from the output below) seem to have a possible read mapping problem and thus cbc can't solve. Is there something I'm missing from the log output that would indicate a problem? If so, how might I fix it?

I'm using the latest versions of razers3 and optitype.

mapping with 4 threads...

0:00:01.50 Mapping C500.TCGA-BT-A42B-10A-01D-A23K-08.3_gdc_realn_R1.fastq to GEN reference...

0:29:46.60 Mapping C500.TCGA-XF-A9T2-10A-01D-A42H-08.2_gdc_realn_R2.fastq to GEN reference...

0:44:43.01 Generating binary hit matrix. 0:44:43.09 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_11_14_41_1.bam started. Number of HLA reads loaded (updated every thousand): 1K...2K...3K...4K...5K...6K...7K...8K.../Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py:239: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict( dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order. pos_df = pd.DataFrame.from_items(iter(hits.items())).T

0:44:47.81 8604 reads loaded. Creating dataframe... 0:44:48.28 Dataframes created. Shape: 8604 x 11179, hits: 2717490 (2717490), sparsity: 1 in 35.39 0:44:48.89 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_11_14_41_2.bam started. Number of HLA reads loaded (updated every thousand): 1K...2K...3K...4K...5K...6K...7K...8K... 0:44:52.14 8222 reads loaded. Creating dataframe... 0:44:52.59 Dataframes created. Shape: 8222 x 11179, hits: 1917861 (1917861), sparsity: 1 in 47.93 0:44:53.58 Alignment pairing completed. 0 paired, 16826 unpaired, 0 discordant

WARNING: Less than 10% of reads could be paired. Consider an appropriate unpaired_weight setting in your config file (currently 0.000), because you may need to resort to using u npaired reads.

0:44:57.43 temporary pruning of identical rows and columns

0:44:57.44 Size of mtx with unique rows and columns: (0, 1) 0:44:57.44 determining minimal set of non-overshadowed alleles

0:44:57.45 Keeping only the minimal number of required alleles (1,)

0:44:57.45 Creating compact model...

starting ilp solver with 4 threads...

0:44:57.45 Initializing OptiType model... WARNING: Constant objective detected, replacing with a placeholder to prevent solver failure. Welcome to the CBC MILP Solver Version: 2.9.9 Build Date: Oct 13 2018

command line - /home/kgvx167/anaconda3/bin/cbc -threads 4 -printingOptions all -import /home/kgvx167/tmpolFvmh.pyomo.lp -stat=1 -solve -solu /home/kgvx167/tmpolFvmh.pyomo.soln ( default strategy 1) threads was changed from 0 to 4 Option for printingOptions changed from normal to all CoinLpIO::readLp(): Maximization problem reformulated as minimization Presolve 0 (-4) rows, 0 (-3) columns and 0 (-5) elements Statistics for presolved model Original problem has 1 integers (1 of which binary)

Problem has 0 rows, 0 columns (0 with objective) and 0 elements Column breakdown: 0 of type 0.0->inf, 0 of type 0.0->up, 0 of type lo->inf, 0 of type lo->up, 0 of type free, 0 of type fixed, 0 of type -inf->0.0, 0 of type -inf->up, 0 of type 0.0->1.0 Row breakdown: 0 of type E 0.0, 0 of type E 1.0, 0 of type E -1.0, 0 of type E other, 0 of type G 0.0, 0 of type G 1.0, 0 of type G other, 0 of type L 0.0, 0 of type L 1.0, 0 of type L other, 0 of type Range 0.0->1.0, 0 of type Range other, 0 of type Free Continuous objective value is 0 - 0.00 seconds Cgl0004I processed model has 0 rows, 0 columns (0 integer (0 of which binary)) and 0 elements Cbc3007W No integer variables - nothing to do Cuts at root node changed objective from 0 to -1.79769e+308 Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)

Result - Optimal solution found

Objective value: 0.00000000 Enumerated nodes: 0 Total iterations: 0 Time (CPU seconds): 0.02 Time (Wallclock seconds): 0.00

Total time (CPU seconds): 0.04 (Wallclock seconds): 0.01

/home/kgvx167/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/series.py:851: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative.

See the documentation here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike return self.loc[key]

0:44:57.78 Result dataframe has been constructed...

andras86 commented 5 years ago

As I assume it's not due to massive insert length, it might have to do with read names. Can you put the first few lines of those R1 and R2 fastq files in here? I already don't like the way those filenames look, they don't just differ in R1/R2 but their prefixes are totally different. Could be the case for the read IDs inside them as well. Since OptiType processes two single-end mapped files and pairs them up later based on the read IDs, if they have different prefixes for R1 and R2 reads, it won't be able to pair them. If that's the case, you'll have to hack something in to make it work. I can help you with that after the weekend. Btw, have you actually checked if those fastq files belong together? It would be such unusual practice to give corresponding R1 and R2 files different prefixes...

toddcreasy commented 5 years ago

Oh man. I can't believe I didn't see that. Yes, I don't like the filenames look either :) There is clearly a bug on my end. Thanks for noticing.

toddcreasy commented 5 years ago

Hi again,

I've fixed the read pair issue but I'm still getting a lot of files with no reads mapping or some other issue. I've put 3 examples below where the results fail. The final .tsv file has only A01:01 A01:01 with 0 reads mapped.

==== ERROR EXAMPLE 1 ====

mapping with 4 threads...

0:00:01.07 Mapping C500.TCGA-DK-AA6M-10A-01D-A394-08.2_gdc_realn_R1.fastq to GEN reference...

0:18:58.54 Mapping C500.TCGA-DK-AA6M-10A-01D-A394-08.2_gdc_realn_R2.fastq to GEN reference...

0:34:20.73 Generating binary hit matrix. 0:34:20.84 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_58_1.bam started. Number of HLA reads loaded (updated every thousand): 1K...2K...3K...4K...5K.../Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py:239: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), .. .) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order. pos_df = pd.DataFrame.from_items(iter(hits.items())).T

0:34:25.09 5380 reads loaded. Creating dataframe... 0:34:25.62 Dataframes created. Shape: 5380 x 11179, hits: 1576631 (1576631), sparsity: 1 in 38.15 0:34:26.66 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_58_2.bam started. Number of HLA reads loaded (updated every thousand): 1K...[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes Traceback (most recent call last): File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/OptiTypePipeline.py", line 312, in pos2, read_details2 = ht.pysam_to_hdf(bam_paths[1]) File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py", line 209, in pysam_to_hdf for aln in sam: File "pysam/libcalignmentfile.pyx", line 1854, in pysam.libcalignmentfile.AlignmentFile.next IOError: truncated file

==== ERROR EXAMPLE 2 ==== mapping with 4 threads...

0:00:00.68 Mapping C500.TCGA-FD-A5BX-10A-01D-A26K-08.5_gdc_realn_R1.fastq to GEN reference...

0:19:25.13 Mapping C500.TCGA-FD-A5BX-10A-01D-A26K-08.5_gdc_realn_R2.fastq to GEN reference...

0:36:35.57 Generating binary hit matrix. 0:36:35.79 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_59_1.bam started. Number of HLA reads loaded (updated every thousand): 1K...2K...3K...4K...5K.../Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py:239: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), .. .) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order. pos_df = pd.DataFrame.from_items(iter(hits.items())).T

0:36:42.89 5909 reads loaded. Creating dataframe... 0:36:43.19 Dataframes created. Shape: 5909 x 11179, hits: 2083469 (2083469), sparsity: 1 in 31.71 Traceback (most recent call last): File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/OptiTypePipeline.py", line 312, in pos2, read_details2 = ht.pysam_to_hdf(bam_paths[1]) File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py", line 186, in pysam_to_hdf sam = pysam.AlignmentFile(samfile, sam_or_bam) File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 983, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

=== ERROR EXAMPLE 3 === mapping with 4 threads...

0:00:01.07 Mapping C500.TCGA-4Z-AA7S-10A-01D-A394-08.2_gdc_realn_R1.fastq to GEN reference...

0:18:55.35 Mapping C500.TCGA-4Z-AA7S-10A-01D-A394-08.2_gdc_realn_R2.fastq to GEN reference...

0:39:02.31 Generating binary hit matrix. 0:39:02.35 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_54_1.bam started. Number of HLA reads loaded (updated every thousand): 1K...2K...3K...4K.../Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py:239: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) in stead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order. pos_df = pd.DataFrame.from_items(iter(hits.items())).T

0:39:03.89 4359 reads loaded. Creating dataframe... 0:39:04.06 Dataframes created. Shape: 4359 x 11179, hits: 1346154 (1346157), sparsity: 1 in 36.20 0:39:05.59 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_54_2.bam started. Number of HLA reads loaded (updated every thousand): 1K...2K... 0:39:06.26 2602 reads loaded. Creating dataframe... 0:39:06.36 Dataframes created. Shape: 2602 x 11179, hits: 573528 (573528), sparsity: 1 in 50.72 0:39:06.79 Alignment pairing completed. 0 paired, 6961 unpaired, 0 discordant

WARNING: Less than 10% of reads could be paired. Consider an appropriate unpaired_weight setting in your config file (currently 0.000), because you may need to resort to using u npaired reads.

0:39:09.49 temporary pruning of identical rows and columns

0:39:09.51 Size of mtx with unique rows and columns: (0, 1) 0:39:09.51 determining minimal set of non-overshadowed alleles

0:39:09.51 Keeping only the minimal number of required alleles (1,)

0:39:09.51 Creating compact model...

starting ilp solver with 4 threads...

0:39:09.51 Initializing OptiType model... WARNING: Constant objective detected, replacing with a placeholder to prevent solver failure. Welcome to the CBC MILP Solver Version: 2.9.9 Build Date: Oct 13 2018

command line - /home/kgvx167/anaconda3/bin/cbc -threads 4 -printingOptions all -import /home/kgvx167/tmpi3v0gI.pyomo.lp -stat=1 -solve -solu /home/kgvx167/tmpi3v0gI.pyomo.soln ( default strategy 1) threads was changed from 0 to 4 Option for printingOptions changed from normal to all CoinLpIO::readLp(): Maximization problem reformulated as minimization Presolve 0 (-4) rows, 0 (-3) columns and 0 (-5) elements Statistics for presolved model Original problem has 1 integers (1 of which binary)

Problem has 0 rows, 0 columns (0 with objective) and 0 elements Column breakdown: 0 of type 0.0->inf, 0 of type 0.0->up, 0 of type lo->inf, 0 of type lo->up, 0 of type free, 0 of type fixed, 0 of type -inf->0.0, 0 of type -inf->up, 0 of type 0.0->1.0 Row breakdown: 0 of type E 0.0, 0 of type E 1.0, 0 of type E -1.0, 0 of type E other, 0 of type G 0.0, 0 of type G 1.0, 0 of type G other, 0 of type L 0.0, 0 of type L 1.0, 0 of type L other, 0 of type Range 0.0->1.0, 0 of type Range other, 0 of type Free Continuous objective value is 0 - 0.01 seconds Cgl0004I processed model has 0 rows, 0 columns (0 integer (0 of which binary)) and 0 elements Cbc3007W No integer variables - nothing to do Cuts at root node changed objective from 0 to -1.79769e+308 Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)

Result - Optimal solution found

Objective value: 0.00000000 Enumerated nodes: 0 Total iterations: 0 Time (CPU seconds): 0.04 Time (Wallclock seconds): 0.00

Total time (CPU seconds): 0.09 (Wallclock seconds): 0.01

/home/kgvx167/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/series.py:851: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative.

See the documentation here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike return self.loc[key]

0:39:09.85 Result dataframe has been constructed...

toddcreasy commented 5 years ago

Here is another error (error example 4):

mapping with 1 threads...

0:00:00.88 Mapping C500.TCGA-CF-A47T-10A-01D-A23U-08.3_gdc_realn_R1.fastq to GEN reference...

0:17:04.69 Mapping C500.TCGA-CF-A47T-10A-01D-A23U-08.3_gdc_realn_R2.fastq to GEN reference...

0:32:02.68 Generating binary hit matrix. Traceback (most recent call last): File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/OptiTypePipeline.py", line 309, in pos, read_details = ht.pysam_to_hdf(bam_paths[0]) File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py", line 186, in pysam_to_hdf sam = pysam.AlignmentFile(samfile, sam_or_bam) File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 944, in pysam.libcalignmentfile.AlignmentFile._open File "pysam/libchtslib.pyx", line 366, in pysam.libchtslib.HTSFile.check_truncation IOError: no BGZF EOF marker; file may be truncated

BioLaoXu commented 4 years ago

I have account the same issue like this:

Traceback (most recent call last): File "/Analysis/bioinfo/xclv/HLA_typing/OptiType/OptiTypePipeline.py", line 312, in pos2, read_details2 = ht.pysam_to_hdf(bam_paths[1]) File "/Analysis/bioinfo/xclv/HLA_typing/OptiType/hlatyper.py", line 186, in pysam_to_hdf sam = pysam.AlignmentFile(samfile, sam_or_bam) File "pysam/libcalignmentfile.pyx", line 736, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 985, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

At first I thought there was something wrong with my fastq data, but I verified that my data was ok.

my datas are so large(>61M,10X RNA),so i wonder if it is because my data is too large,and i parallel run OptiType at my datas, so i change my way to run OptiType sample by sample,and this error strangly disappear.