toomanycats commented 7 years ago

I love your software, it's fast and super helpful. I'm using version dicomtocsv 0.7.10.

I'm building a DICOM inventory file for a PostgreSQL database with > 20e6 rows. Because of thenumber of rows, I can't wait to save the output to disk, and prefer to write lines as they are completed. That way, if the process fails, I at least can still prototype the rest of the software with what I got.

I usually use Python and Pandas, to clean the data for missing data. I end up adding the full path to the csv file, as well as dropping the header. Perhaps I missed an option ? Please see below shell scripts.

So far I'm using dicomtocsv like this:

create_dicom_inventory.sh

!/bin/bash

trap "exit 1;" SIGINT

tar_dir=$1 output_file=$2

cat header.csv > $output_file find $tar_dir -type f -name "*.dcm" | parallel -j 10 -n 1 -d'\n' 'dicomtocsv_wrapper.sh {}' >> $output_file

sed -i 's/\r//g' $output_file exit 0

dicomtocsv_wrapper.sh

input="$1" row=$(dicomtocsv --silent -q tag_list.txt --image --first-nonzero "$input" | tail -n 1) printf "\"%s\",%s\n" "$input" "$row"

dgobbi commented 7 years ago

The --noheader option was added to dicomtocsv 0.7.11. Please upgrade, if you can.

Have you tried using the "find" command itself to divide problem into smaller pieces? For example, like this:

find $tar_dir -type f -name "*.dcm" -exec ./dicomtocsv_wrapper.sh {} + >> $output_file

This will send multiple dicom files to dicomcsv_wrapper.sh, instead of sending just one file. That should increase the speed (since dicomtocsv is really mean to take multiple files as input, it can't work at full efficiency if you only give it one file).

If you do this, you will also have to modify dicomtocsv_wrapper.sh so that it can take multiple files, e.g. use "${@}" instead of "$1":

dicomtocsv --noheader --silent -q tag_list.txt --image "${@}"

dgobbi commented 7 years ago

Since dicomcvs currently reads all the files, sorts them, and then writes them out (in that order), the thing that I need to add in order to make it write out immediately is a "--nosort" option. If it doesn't sort the images, then it can write the record for each image as soon as it finishes reading that image.

toomanycats commented 7 years ago

Thank you for the response and advice. I'll recompile and try out the --no-header option.

A --no-sort would be great. The reason for generating the massive csv is for loading into a SQL database. Cheers, dpc

-- Daniel P Cuneo dpcuneo@fastmail.fm cell: 415-871-1909 Skype: dpcuneo1 https://www.linkedin.com/in/danielcuneo

On Sat, Sep 9, 2017, at 07:17 PM, David Gobbi wrote:

Since dicomcvs currently reads all the files, sorts them, and then writes them out (in that order), the thing that I need to add in order to make it write out immediately is a "--nosort" option. If it doesn't sort the images, then it can write the record for each image as soon as it finishes reading that image.> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub[1], or mute the thread[2].>

Links:

dgobbi / vtk-dicom

dicomtocsv: no header option and file path in output #142

create_dicom_inventory.sh

!/bin/bash

dicomtocsv_wrapper.sh