davidgohel / officer

:cop: officer: office documents from R
https://ardata-fr.github.io/officeverse/
Other
592 stars 106 forks source link

delete tables and figures from docx file using R #143

Closed ghost closed 6 years ago

ghost commented 6 years ago

Anyone knows anthing about deleting all tables and figures from a set of docx file (about 400 file) I tried with package offier but it works with keywords and I haave no commun pattern for the files. Is there any parameter to reach directly the tables and the figures or are there some other solution?

davidgohel commented 6 years ago

@aazzaa123 I can not reproduce any case here. Could you provide more detail about the bug you are facing? Could you follow the guidelines please (I re-copied them below)?

Is it the same than https://stackoverflow.com/questions/51370372/delete-tables-and-figures-from-a-set-of-docx-files-using-r?

If yes, it seems the question is how to extract content from a file and not how to delete content? This subject is documented here: https://davidgohel.github.io/officer/articles/officer_reader.html#import-word-document. You would have to filter elements where content_type %in% "paragraph"

David


davidgohel commented 6 years ago

Yes - I saw it but it is not reproducible :) So it's a question and not an issue.

As already written, the answer is documented here: https://davidgohel.github.io/officer/articles/officer_reader.html#import-word-document.

You don't have to delete anything, you have to use docx_summary and filter with column content_type. You will have to filter elements where content_type %in% "paragraph" and maybe drop paragraphs where the stylename is 'captions' (or whatever stylename you used for caption).

KR

davidgohel commented 6 years ago

Can you show me theses in the results of docx_summary???

davidgohel commented 6 years ago

Sorry, your code is not reproducible, PLEASE follow the guidelines explained in the issue template.

davidgohel commented 6 years ago

Can you add a docx file that has to be imported? (you should be able to drag and drop it in a new comment section in this thread), it will be uploaded).

davidgohel commented 6 years ago

Here is a code:

library(officer)   

doc <- read_docx("~/Downloads/issue143/Mitochondrial.DNA.docx")
data <- docx_summary(doc)
data <- data[data$content_type %in% "paragraph", ]
# data is in data$text

A sample of Mitochondrial.DNA.docx can be seen below:

# sample(data$text, size = 20)
 [1] "paired and control Thai individuals, Clin. Genet. 66 (2004)"                
 [2] "[4] X. Estivill, N. Govera, E. Barcelo, C. Badenas, E. Romeo, L. Moral,"    
 [3] "the restriction endonuclease HaeIII (Amersham Pharmacia Biotech). In"       
 [4] "associated with the mitochondrial tRNASer(UCN) gene, as"                    
 [5] ""                                                                           
 [6] "T7511C mutation in the mitochondrial DNA tRNASer(UCN) gene,"                
 [7] "using standard procedures [23]."                                            
 [8] "Cx31-4F50 GCTCTGCTACCTCATCTGCC 3020224"                                     
 [9] "cycles: 40 s at 94 °C, 50 s at 67 °C–58 °C, and 1 min at 72 °C, and then 35"
[10] "[18] D.P. Kelsell, J. Dunlop, H.P. Stevens, N.J. Lench, J.N. Liang, G."     
[11] "at mitochondrial nucleotides 750 and 1438 were observed in"                 
[12] "Cx30-3R50 AGCAGCAGGTAGCACAACTC 3020"                                        
[13] "Ackah, J. Wu, D.I. Choo, M.X. Guan, Mutational analysis of the"             
[14] "R. Scozzi, L. D’Urbano, M. Zeviani, A. Torroni, Familial progressive"       
[15] "HaeIII digest; Un, undigested PCR product."                                 
[16] "We thank the patients and their families for their coop-"                   
[17] "the most frequent mutation in the GJB2 gene accounting"                     
[18] "Direct sequencing of the GJB3 gene revealed a new poly-"                    
[19] "described by Wattanasirichaigoon et al. [33]."                              
[20] "maternal pattern of inheritance."  

In your SO questions, you are asking to delete all the tables and captions. This is not possible in your document example as it does not contain any named style nor table. All the content is unformated and the tables are not real tables but indented text:

> table(data$content_type, data$style_name)
< table of extent 1 x 0 >
davidgohel commented 6 years ago

no, sorry.

github-actions[bot] commented 2 years ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.