COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
780 stars 165 forks source link

Cleanup STDOUT and STDERR messages from salmon-0.10.2 #242

Open mmokrejs opened 6 years ago

mmokrejs commented 6 years ago

Hi, would you mind cleaning up the STDOUT and STDERR output from salmon quant and salmon index? Commonly STDOUT should be used for normal output messages and STDERR for errors or at worst warnings. Salmon writes most of the messages to STDERR.

  1. Version Info: This is the most recent version of Salmon. is output to STDOUT. I find it useless. If salmon run some network connection to figure out its version it is a) prone to errors, b) I would expect a fat warning in the documentation a phone home feature is builtin, c) it is likely to fail on more network-restricted installations. Or, d), the message is incorrect. I suggest drop the message altogether.

  2. salmon quant writes a lot of normal messages to STDERR. Please use STDOUT instead. If a program exits with a non-zero exit code it is common to read its STDERR output to learn what was the cause for the error. It is awkward to realize there is lots of unrelated text. Please follow common rules on Unix.

  3. The docs at http://salmon.readthedocs.io/en/latest/salmon.html did not mention the fmd index is just a plain index from bwa. Why don't you instruct users to use bwa index instead? It would be clearer (if that is the type of index you employ).

  4. salmon index behavior

salmon index -t Homo_sapiens.GRCh38.cdna.all.fa -i Homo_sapiens.GRCh38.cdna.all --type quasi -k 31
index ["Homo_sapiens.GRCh38.cdna.all"] did not previously exist  . . . creating it
[2018-06-25 19:25:57.122] [jLog] [info] building index
RapMap Indexer

[Step 1 of 4] : counting k-mers
[2018-06-25 19:25:57.176] [jointLog] [warning] Entry with header [ENST00000434970.2], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-06-25 19:25:57.176] [jointLog] [warning] Entry with header [ENST00000448914.1], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
...
[2018-06-25 19:26:07.297] [jointLog] [warning] Entry with header [ENST00000579054.1], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-06-25 19:26:07.451] [jointLog] [warning] Entry with header [ENST00000634174.1], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-06-25 19:26:07.459] [jointLog] [warning] Entry with header [ENST00000573437.1], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
Elapsed time: 10.3549s

[2018-06-25 19:26:07.480] [jointLog] [warning] Removed 11768 transcripts that were sequence duplicates of indexed transcripts.
[2018-06-25 19:26:07.480] [jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag
Replaced 5 non-ATCG nucleotides
Clipped poly-A tails from 1453 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.0360287s
Writing sequence data to file . . . done
Elapsed time: 1.30042s
[info] Building 32-bit suffix array (length of generalized text is 289267207)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 2.18092s
done
Elapsed time: 32.3295s
^M^Mprocessed 0 positions^M^Mprocessed 1000000 positions^M^Mprocessed 2000000 positions^M^Mprocessed 3000000 positions^M^Mprocessed 4000000 positions^M^Mprocessed 5000000 positions^M^Mprocessed 6000000 positions^M^Mprocessed 7000000 positions^M^Mprocessed 8000000 positions^M^Mprocessed 9000000 positions^M^Mprocessed 10000000 positions^M^Mprocessed 11000000 positions^M^Mprocessed 12000000 positions^M^Mprocessed 13000000 positions^M^Mprocessed 14000000 positions^M^Mprocessed 15000000 positions^M^Mprocessed 16000000 positions^M^Mprocessed 17000000 positions^M^Mprocessed 18000000 positions^M^Mprocessed 19000000 positions^M^Mprocessed 20000000 positions^M^Mprocessed 21000000 positions^M^Mprocessed 22000000 positions^M^Mprocessed 23000000 positions^M^Mprocessed 24000000 positions^M^Mprocessed 25000000 positions^M^Mprocessed 26000000 positions^M^Mprocessed 27000000 positions^M^Mprocessed 28000000 positions^M^Mprocessed 29000000 positions^M^M^Mprocessed 30000000 positions^M^Mprocessed 31000000 positions^M^Mprocessed 32000000 positions^M^Mprocessed 33000000 positions^M^Mprocessed 34000000 positions^M^Mprocessed 35000000 positions^M^Mprocessed 36000000 positions^M^Mprocessed 37000000 positions^M^Mprocessed 38000000 positions^M^Mprocessed 39000000 positions^M^Mprocessed 40000000 positions^M^Mprocessed 41000000 positions^M^Mprocessed 42000000 positions^M^Mprocessed 43000000 positions^M^Mprocessed 44000000 positions^M^Mprocessed 45000000 positions^M^Mprocessed 46000000 positions^M^Mprocessed 47000000 positions^M^Mprocessed 48000000 positions^M^Mprocessed 49000000 positions^M^Mprocessed 50000000 positions^M^Mprocessed 51000000 positions^M^Mprocessed 52000000 positions^M^Mprocessed 53000000 positions^M^Mprocessed 54000000 positions^M^Mprocessed 55000000 positions^M^Mprocessed 56000000 positions^M^Mprocessed 57000000 positions^M^Mprocessed 58000000 positions^M^Mprocessed 59000000 positions^M^Mprocessed 60000000 positions^M^Mprocessed 61000000 positions^M^Mprocessed 62000000 positions^M^Mprocessed 63000000 positions^M^Mprocessed 64000000 positions^M^Mprocessed 65000000 positions^M^Mprocessed 66000000 positions^M
^Mprocessed 67000000 positions^M^Mprocessed 68000000 positions^M^Mprocessed 69000000 positions^M^Mprocessed 70000000 positions^M^Mprocessed 71000000 positions^M^Mprocessed 72000000 positions^M^Mprocessed 73000000 positions^M^Mprocessed 74000000 positions^M^Mprocessed 75000000 positions^M^Mprocessed 76000000 positions^M^Mprocessed 77000000 positions^M^Mprocessed 78000000 positions^M^Mprocessed 79000000 positions^M^Mprocessed 80000000 positions^M^Mprocessed 81000000 positions^M^Mprocessed 82000000 positions^M^Mprocessed 83000000 positions^M^Mprocessed 84000000 positions^M^Mprocessed 85000000 positions^M^Mprocessed 86000000 positions^M^Mprocessed 87000000 positions^M^Mprocessed 88000000 positions^M^Mprocessed 89000000 positions^M^Mprocessed 90000000 positions^M^Mprocessed 91000000 positions^M^Mprocessed 92000000 positions^M^Mprocessed 93000000 positions^M^Mprocessed 94000000 positions^M^Mprocessed 95000000 positions^M^Mprocessed 96000000 positions^M^Mprocessed 97000000 positions^M^Mprocessed 98000000 positions^M^Mprocessed 99000000 positions^M^Mprocessed 100000000 positions^M^Mprocessed 101000000 positions^M^Mprocessed 102000000 positions^M^Mprocessed 103000000 positions^M^Mprocessed 104000000 positions^M^Mprocessed 105000000 positions^M^Mprocessed 106000000 positions^M^Mprocessed 107000000 positions^M^Mprocessed 108000000 positions^M^Mprocessed 109000000 positions^M^Mprocessed 110000000 positions^M^Mprocessed 111000000 positions^M^Mprocessed 112000000 positions^M^Mprocessed 113000000 positions^M^Mprocessed 114000000 positions^M^Mprocessed 115000000 positions^M^Mprocessed 116000000 positions^M^Mprocessed 117000000 positions^M^Mprocessed 118000000 positions^M^Mprocessed 119000000 positions^M^Mprocessed 120000000 positions^M^Mprocessed 121000000 positions^M^Mprocessed 122000000 positions^M^Mprocessed 123000000 positions^M^Mprocessed 124000000 positions^M^Mprocessed 125000000 positions^M^Mprocessed 126000000 positions^M^Mprocessed 127000000 positions^M^Mprocessed 128000000 positions^M^Mprocessed 129000000 positions^M^Mprocessed 130000000 positions^M^Mprocessed 131000000 positions^M^Mprocessed 132000000 positions^M^Mprocessed 133000000 positions^M^Mprocessed 134000000 positions^M^Mprocessed 135000000 positions^M^Mprocessed 136000000 positions^M^Mprocessed 137000000 positions^M^Mprocessed 138000000 positions^M^Mprocessed 139000000 positions^M^Mprocessed 140000000 positions^M^Mprocessed 141000000 positions^M^Mprocessed 142000000 positions^M^Mprocessed 143000000 positions^M^Mprocessed 144000000 positions^M^Mprocessed 145000000 positions^M^Mprocessed 146000000 positions^M^Mprocessed 147000000 positions^M^Mprocessed 148000000 positions^M^Mprocessed 149000000 positions^M^Mprocessed 150000000 positions^M^Mprocessed 151000000 positions^M^Mprocessed 152000000 positions^M^Mprocessed 153000000 positions^M^Mprocessed 154000000 positions^M^Mprocessed 155000000 positions^M^Mprocessed 156000000 positions^M^Mprocessed 157000000 positions^M^Mprocessed 158000000 positions^M^Mprocessed 159000000 positions^M^Mprocessed 160000000 positions^M^Mprocessed 161000000 positions^M^Mprocessed 162000000 positions^M^Mprocessed 163000000 positions^M^Mprocessed 164000000 positions^M^Mprocessed 165000000 positions^M^Mprocessed 166000000 positions^M^Mprocessed 167000000 positions^M^Mprocessed 168000000 positions^M^Mprocessed 169000000 positions^M^Mprocessed 170000000 positions^M^Mprocessed 171000000 positions^M^Mprocessed 172000000 positions^M^Mprocessed 173000000 positions^M^Mprocessed 174000000 positions^M^Mprocessed 175000000 positions^M^Mprocessed 176000000 positions^M^Mprocessed 177000000 positions^M^Mprocessed 178000000 positions^M^Mprocessed 179000000 positions^M^Mprocessed 180000000 positions^M^Mprocessed 181000000 positions^M^Mprocessed 182000000 positions^M^Mprocessed 183000000 positions^M^Mprocessed 184000000 positions^M^Mprocessed 185000000 positions^M^Mprocessed 186000000 positions^M^Mprocessed 187000000 positions^M^Mprocessed 188000000 positions^M^Mprocessed 189000000 positions^M^Mprocessed 190000000 positions^M^Mprocessed 191000000 positions^M^Mprocessed 192000000 positions^M^Mprocessed 193000000 positions^M^Mprocessed 194000000 positions^M^Mprocessed 195000000 positions^M^Mprocessed 196000000 positions^M^Mprocessed 197000000 positions^M^Mprocessed 198000000 positions^M^Mprocessed 199000000 positions^M^Mprocessed 200000000 positions^M^Mprocessed 201000000 positions^M^Mprocessed 202000000 positions^M^Mprocessed 203000000 positions^M^Mprocessed 204000000 positions^M^Mprocessed 205000000 positions^M^Mprocessed 206000000 positions^M^Mprocessed 207000000 positions^M^Mprocessed 208000000 positions^M^Mprocessed 209000000 positions^M^Mprocessed 210000000 positions^M^Mprocessed 211000000 positions^M^Mprocessed 212000000 positions^M^Mprocessed 213000000 positions^M^Mprocessed 214000000 positions^M^Mprocessed 215000000 positions^M^Mprocessed 216000000 positions^M^Mprocessed 217000000 positions^M^Mprocessed 218000000 positions^M^Mprocessed 219000000 positions^M^Mprocessed 220000000 positions^M^Mprocessed 221000000 positions^M^Mprocessed 222000000 positions^M^Mprocessed 223000000 positions^M^Mprocessed 224000000 positions^M^Mprocessed 225000000 positions^M^Mprocessed 226000000 positions^M^Mprocessed 227000000 positions^M^Mprocessed 228000000 positions^M^Mprocessed 229000000 positions^M^Mprocessed 230000000 positions^M^Mprocessed 231000000 positions^M^Mprocessed 232000000 positions^M^Mprocessed 233000000 positions^M^Mprocessed 234000000 positions^M^Mprocessed 235000000 positions^M^Mprocessed 236000000 positions^M^Mprocessed 237000000 positions^M^Mprocessed 238000000 positions^M^Mprocessed 239000000 positions^M^Mprocessed 240000000 positions^M^Mprocessed 241000000 positions^M^Mprocessed 242000000 positions^M^Mprocessed 243000000 positions^M^Mprocessed 244000000 positions^M^Mprocessed 245000000 positions^M^Mprocessed 246000000 positions^M^Mprocessed 247000000 positions^M^Mprocessed 248000000 positions^M^Mprocessed 249000000 positions^M^Mprocessed 250000000 positions^M^Mprocessed 251000000 positions^M^Mprocessed 252000000 positions^M^Mprocessed 253000000 positions^M^Mprocessed 254000000 positions^M^Mprocessed 255000000 positions^M^Mprocessed 256000000 positions^M^Mprocessed 257000000 positions^M^Mprocessed 258000000 positions^M^Mprocessed 259000000 positions^M^Mprocessed 260000000 positions^M^Mprocessed 261000000 positions^M^Mprocessed 262000000 positions^M^Mprocessed 263000000 positions^M^Mprocessed 264000000 positions^M^Mprocessed 265000000 positions^M^Mprocessed 266000000 positions^M^Mprocessed 267000000 positions^M^Mprocessed 268000000 positions^M^Mprocessed 269000000 positions^M^Mprocessed 270000000 positions^M^Mprocessed 271000000 positions^M^Mprocessed 272000000 positions^M^Mprocessed 273000000 positions^M^Mprocessed 274000000 positions^M^Mprocessed 275000000 positions^M^Mprocessed 276000000 positions^M^Mprocessed 277000000 positions^M^Mprocessed 278000000 positions^M^Mprocessed 279000000 positions^M^Mprocessed 280000000 positions^M^Mprocessed 281000000 positions^M^Mprocessed 282000000 positions^M^Mprocessed 283000000 positions^M^Mprocessed 284000000 positions^M^Mprocessed 285000000 positions^M^Mprocessed 286000000 positions^M^Mprocessed 287000000 positions^M^Mprocessed 288000000 positions^M^Mprocessed 289000000 positions
khash had 109134690 keys
saving hash to disk . . . done
Elapsed time: 12.6069s
[2018-06-25 19:29:02.297] [jLog] [info] done building index

It would be more interesting to see what filename is used for the index. Showing the "basename" of the index file is not very helpful. I would prefer to see the filename(s) on the beginning and end lines:

index ["Homo_sapiens.GRCh38.cdna.all"] did not previously exist  . . . creating it
[2018-06-25 19:25:57.122] [jLog] [info] building index
...
[2018-06-25 19:29:02.297] [jLog] [info] done building index
  1. The duplicates have same sequence or ID or both? The message should be clearer.

I wonder what are these duplicates in a human cdna predicted, as available from ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/cdna :

[2018-06-25 19:26:07.480] [jointLog] [warning] Removed 11768 transcripts that were sequence duplicates of indexed transcripts.
  1. For the bwa-based index again, the logs are too verbose on one hand and on the second, they do not say what files were created. I doubt any file with prefix Homo_sapiens.GRCh38.cdna.all/bwaidx was created.
+ salmon index -t Homo_sapiens.GRCh38.cdna.all.fa -i Homo_sapiens.GRCh38.cdna.all --type fmd
outputPrefix = "Homo_sapiens.GRCh38.cdna.all/bwaidx"
[2018-06-25 19:29:02.497] [jLog] [info] building index
[bwa_index] Pack FASTA... 2.87 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=609536484, availableWord=54888760
[BWTIncConstructFromPacked] 10 iterations done. 87569268 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 164630980 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 233119636 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 293988548 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 348084948 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 396161956 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 438888868 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 476860644 characters processed.
[BWTIncConstructFromPacked] 90 iterations done. 510606036 characters processed.
[BWTIncConstructFromPacked] 100 iterations done. 540594980 characters processed.
[BWTIncConstructFromPacked] 110 iterations done. 567245236 characters processed.
[BWTIncConstructFromPacked] 120 iterations done. 590928020 characters processed.
[bwa_index] 279.06 seconds elapse.
[bwa_index] Update BWT... 1.72 sec
[bwa_index] Pack forward-only FASTA... 1.90 sec
[bwa_index] Construct SA from BWT and Occ... 59.56 sec
[2018-06-25 19:34:53.084] [jLog] [info] done building index

Doh, something unexpected from the logs, isn't it?

$ ls -latr ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all
total 8374704
drwx------. 3 mmokrejs mmokrejs       4096 Jun 25 19:25 ..
-rw-r--r--. 1 mmokrejs mmokrejs   36158409 Jun 25 19:26 rsd.bin
-rw-r--r--. 1 mmokrejs mmokrejs     423777 Jun 25 19:26 duplicate_clusters.tsv
-rw-r--r--. 1 mmokrejs mmokrejs  294997212 Jun 25 19:26 txpInfo.bin
-rw-r--r--. 1 mmokrejs mmokrejs 1157068836 Jun 25 19:26 sa.bin
-rw-r--r--. 1 mmokrejs mmokrejs 1779709484 Jun 25 19:29 hash.bin
-rw-r--r--. 1 mmokrejs mmokrejs         75 Jun 25 19:29 refInfo.json
-rw-r--r--. 1 mmokrejs mmokrejs       9816 Jun 25 19:29 quasi_index.log
-rw-r--r--. 1 mmokrejs mmokrejs        666 Jun 25 19:29 header.json
-rw-r--r--. 1 mmokrejs mmokrejs  304768324 Jun 25 19:33 bwaidx.bwt
-rw-r--r--. 1 mmokrejs mmokrejs   76192062 Jun 25 19:33 bwaidx.pac
-rw-r--r--. 1 mmokrejs mmokrejs   50007825 Jun 25 19:33 bwaidx.ann
-rw-r--r--. 1 mmokrejs mmokrejs         89 Jun 25 19:33 bwaidx.amb
drwxr-xr-x. 2 mmokrejs mmokrejs       4096 Jun 25 19:34 .
-rw-r--r--. 1 mmokrejs mmokrejs         95 Jun 25 19:34 versionInfo.json
-rw-r--r--. 1 mmokrejs mmokrejs        115 Jun 25 19:34 indexing.log
-rw-r--r--. 1 mmokrejs mmokrejs 4876291928 Jun 25 19:34 bwaidx.sa
$

Hope this helps.

rob-p commented 6 years ago

Hi @mmokrejs,

Thanks again for all the feedback, we'll work on these. I'll mention that (2) is not quite simple as it seems. Specifically, if you write mappings (--writeMappings, -z), the implicit file is STDOUT. I believe this has been discussed with @tseemann in the past. This is intentional so that one can immediately pipe that output to e.g. samtools to convert the SAM format to a BAM format. I am open to cleaner solutions that (ideally) don't involve having to write the BAM files directly.

mmokrejs commented 6 years ago

Hi, thank you for a prompt answer. Well, now I understand more why you write on STDERR ... Hum, I do not know what to do then.

rob-p commented 6 years ago

I think the longer-term solution is for us to directly support BAM writing. This would obviate the need of passing the output through samtools. I will mark this issue as an enhancement to remind us to look into what will be involved in doing that (while we work on addressing the more obvious issues you raise).