PoisonAlien / maftools

Summarize, Analyze and Visualize MAF files from TCGA or in-house studies.
http://bioconductor.org/packages/release/bioc/html/maftools.html
MIT License
445 stars 219 forks source link

bad fill option from read.maf() #144

Closed ShixiangWang closed 6 years ago

ShixiangWang commented 6 years ago

When I use read.maf() function to read maf file from TCGA, Error occur

reading maf..
Error in validateMaf(maf = maf, isTCGA = isTCGA, rdup = removeDuplicatedVariants,  : 
  missing required fields from MAF: Hugo_Symbol

So I debug this function and find fill option is bad for read maf file here.

> maf2 <- fread(maf, 
+               sep = "\t", stringsAsFactors = FALSE, verbose = FALSE, 
+               data.table = TRUE, showProgress = TRUE, header = TRUE, fill = TRUE)
> dim(maf2)
[1] 208185      1
> maf2 <- fread(maf, 
+               sep = "\t", stringsAsFactors = FALSE, verbose = FALSE, 
+               data.table = TRUE, showProgress = TRUE, header = TRUE, fill = FALSE)
|--------------------------------------------------|
|==================================================|
> dim(maf2)
[1] 208180    120

I don't know how fill change the result object, but indeed it generate this error.

PoisonAlien commented 6 years ago

Hi,

Can you post your sessioninfo ?

ShixiangWang commented 6 years ago

@PoisonAlien

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2      maftools_1.6.0      Biobase_2.40.0      BiocGenerics_0.26.0 data.table_1.11.0  
 [6] forcats_0.3.0       stringr_1.3.0       dplyr_0.7.4         purrr_0.2.4         readr_1.1.1        
[11] tidyr_0.8.0         tibble_1.4.2        ggplot2_2.2.1       tidyverse_1.2.1    

loaded via a namespace (and not attached):
  [1] colorspace_1.3-2            rjson_0.2.15                mclust_5.4                 
  [4] circlize_0.4.3              XVector_0.20.0              GenomicRanges_1.32.0       
  [7] GlobalOptions_0.0.13        rstudioapi_0.7              ggrepel_0.7.0              
 [10] bit64_0.9-7                 AnnotationDbi_1.42.0        lubridate_1.7.4            
 [13] xml2_1.2.0                  codetools_0.2-15            splines_3.5.0              
 [16] mnormt_1.5-5                doParallel_1.0.11           jsonlite_1.5               
 [19] Rsamtools_1.32.0            broom_0.4.4                 gridBase_0.4-7             
 [22] cluster_2.0.7-1             compiler_3.5.0              httr_1.3.1                 
 [25] assertthat_0.2.0            Matrix_1.2-14               lazyeval_0.2.1             
 [28] cli_1.0.0                   prettyunits_1.0.2           tools_3.5.0                
 [31] gtable_0.2.0                glue_1.2.0                  GenomeInfoDbData_1.1.0     
 [34] reshape2_1.4.3              Rcpp_0.12.16                slam_0.1-43                
 [37] cellranger_1.1.0            NMF_0.21.0                  Biostrings_2.48.0          
 [40] nlme_3.1-137                rtracklayer_1.40.0          iterators_1.0.9            
 [43] changepoint_2.2.2           psych_1.8.3.3               rvest_0.3.2                
 [46] devtools_1.13.5             rngtools_1.2.4              XML_3.98-1.11              
 [49] zlibbioc_1.26.0             zoo_1.8-1                   scales_0.5.0               
 [52] BSgenome_1.48.0             VariantAnnotation_1.26.0    hms_0.4.2                  
 [55] SummarizedExperiment_1.10.0 RColorBrewer_1.1-2          ComplexHeatmap_1.18.0      
 [58] yaml_2.1.19                 memoise_1.1.0               gridExtra_2.3              
 [61] pkgmaker_0.22               biomaRt_2.36.0              stringi_1.1.7              
 [64] RSQLite_2.1.0               S4Vectors_0.18.0            foreach_1.4.4              
 [67] GenomicFeatures_1.32.0      BiocParallel_1.13.1         shape_1.4.4                
 [70] GenomeInfoDb_1.16.0         rlang_0.2.0                 pkgconfig_2.0.1            
 [73] matrixStats_0.53.1          bitops_1.0-6                lattice_0.20-35            
 [76] bindr_0.1.1                 GenomicAlignments_1.16.0    tidyselect_0.2.4           
 [79] cowplot_0.9.2               bit_1.1-12                  plyr_1.8.4                 
 [82] magrittr_1.5                R6_2.2.2                    IRanges_2.14.0             
 [85] DelayedArray_0.6.0          DBI_1.0.0                   withr_2.1.2                
 [88] pillar_1.2.2                haven_1.1.1                 foreign_0.8-70             
 [91] survival_2.42-3             RCurl_1.95-4.10             modelr_0.1.1               
 [94] crayon_1.3.4                wordcloud_2.5               GetoptLong_0.1.6           
 [97] progress_1.1.2              grid_3.5.0                  readxl_1.1.0               
[100] blob_1.1.1                  digest_0.6.15               xtable_1.8-2               
[103] stats4_3.5.0                munsell_0.4.3               registry_0.5      
PoisonAlien commented 6 years ago

Does your file has header lines - starting with # ? Can you share your maf file, its hard for me to suggest without a reproducible example.

ShixiangWang commented 6 years ago

Yes, the data download from TCGA, it is very big, I put head 200 rows here and maybe you can test with it.

toy.txt

PoisonAlien commented 6 years ago

Hi, Just checked your file. 5th line of your maf file is incompatible with data.tables fread function. You can remove that particular line or all those header lines starting # and it should work fine. Let me know if this helps.

ShixiangWang commented 6 years ago

@PoisonAlien Thanks

xinyueandtianliangle commented 5 years ago

How did it happen? Can we remove 5th line of your maf file without opening it? Sorry, I met the same question. reading maf.. Error in validateMaf(maf = maf, isTCGA = isTCGA, rdup = removeDuplicatedVariants, : missing required fields from MAF: Hugo_Symbol

PoisonAlien commented 5 years ago

You could use sed command for it. If I remember correctly sed -5d would remove 5th line. But wait, could you double check your input file ? Can you share first 10 or 20 lines of your maf file ?